1. Decision Tree
Splitting Indices, Splitting Criteria,
Decision tree construction algorithm
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
2. 4
Constructing decision trees
Strategy: top down
Recursive divide-and-conquer fashion
First: select attribute for root node
Create branch for each possible attribute value
Then: split instances into subsets
One for each branch extending from the node
Finally: repeat recursively for each branch, using
only instances that reach the branch
Stop if all instances have the same class
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
3. Play or not?
• The weather
dataset
5
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
5. Best Split
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
1. Evaluation of splits for each attribute and
the selection of the best split,
determination of splitting attribute,
2. Determination of splitting condition on
the selected splitting attribute
3. Partitioning the data using best split.
6. Splitting Indices
Determining the goodness of a split
1. Information Gain
(From Information theory, entropy)
2. Gini Index
(From economics, measure of diversity )
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
7. Computing purity: the information
measure
• information is a measure of a
reduction of uncertainty
• It represents the expected amount of
information that would be needed to
“place” a new instance in the branch.
7
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
8. Which attribute to select?
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
9. Final decision tree
Splitting stops when data can’t be split any further
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
10. Criterion for attribute selection
Which is the best attribute?
Want to get the smallest tree
Heuristic: choose the attribute that produces the
“purest” nodes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
11. -- Information gain: increases with the average purity of the
subsets
-- Strategy: choose attribute that gives greatest information
gain
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
12. How to compute Informaton
Gain: Entropy
1. When the number of either yes OR no is zero (that is
the node is pure) the information is zero.
2. When the number of yes and no is equal, the
information reaches its maximum because we are very
uncertain about the outcome.
3. Complex scenarios: the measure should be
applicable to a multiclass situation, where a multi-
staged decision must be made.
12
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
14. Entropy: Outlook, sunny
Formulae for computing the entropy:
= (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
15. Measures: Information &
Entropy
• entropy is a probabilistic measure of
uncertainty or ignorance and
information is a measure of a reduction
of uncertainty
• However, in our context we use entropy (ie
the quantity of uncertainty) to measure the
purity of a node.
1
8
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
17. Computing Information Gain
Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
Information gain for attributes from weather data:
gain(Outlook )
gain(Temperature )
gain(Humidity )
gain(Windy )
= 0.247 bits
= 0.029 bits
= 0.152 bits
= 0.048 bits
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
18. Information Gain Drawbacks
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Problematic: attributes with a large number
of values (extreme case: ID code)
19. Weather data with ID code
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No
20. Tree stump for ID code attribute
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Entropy of split (see Weka book 2011: 105-108):
Information gain is maximal for ID code (namely 0.940
bits)
21. Information Gain
Limitations
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Problematic: attributes with a large number
of values (extreme case: ID code)
Subsets are more likely to be pure if there is
a large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
(Another problem: fragmentation)
22. Gain ratio
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Gain ratio: a modification of the information gain
that reduces its bias
Gain ratio takes number and size of branches into
account when choosing an attribute
It corrects the information gain by taking the intrinsic
information of a split into account
Intrinsic information: information about the class is
disregarded.
23. Gain ratios for weather
data
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
24. More on the gain ratio
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
“Outlook” still comes out top
However: “ID code” has greater gain ratio
Standard fix: ad hoc test to prevent splitting on that
type of attribute
Problem with gain ratio: it may overcompensate
May choose an attribute just because its intrinsic
information is very low
Standard fix: only consider attributes with greater
than average information gain
25. Gini index
All attributes are assumed continuous-
valued
Assume there exist several possible split
values for each attribute
May need other tools, such as
clustering, to get the possible split
values
Can be modified for categorical attributes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
28. Splitting Criteria
Let attribute A be a numerical-valued attribute Must determine
the best split point for A (BINARY Split)
Sort the values of A in increasing order
Typically, the midpoint between each pair of adjacent values is
considered as a possible split point (ai+ai+1)/2 is the midpoint
between the values of ai and ai+1
The point with the minimum expected information requirement
for A is selected as the split point
Split
D1 is the set of tuples in D satisfying A ≤ split-point
D2 is the set of tuples in D satisfying A > split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
29. Binary Split
Numerical Values Attributes
Examine each possible split point. The midpoint between each pair
of (sorted) adjacent values is taken as a possible split-point
For each split-point, compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A<=split-point, D2: A> split-
point)
The point that gives the minimum Gini index for attribute A is
selected as its split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
30. Class Histogram
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Two class histograms are used to store the class
distribution for numerical attributes.
31. Binary Split
Categorical Attributes
Examine the partitions resulting from all possible subsets of
{a1…,av}
Each subset SA is a binary test of attribute A of the form
“A∈SA?”
2^v possible subsets. We exclude the power set and the
empty set, then we have 2^v-2 subsets
The subset that gives the minimum Gini index for attribute
A is selected as its splitting subset
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
32. Count Matrix
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The count matrix stores the class distribution of
each value of a categorical attribute.
33. Decision tree construction algorithm
1. Information Gain
• ID3
• C4.5
• C 5
• J 48
2. Gini Index
• SPRINT
• SLIQ
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
34. Iterative Dichotomizer (ID3)
Quinlan (1986)
Each node corresponds to a splitting attribute
Each arc is a possible value of that attribute.
At each node the splitting attribute is selected to be the most
informative among the attributes not yet considered in the path from
the root.
Entropy is used to measure how informative is a node.
The algorithm uses the criterion of information gain to determine the
goodness of a split.
The attribute with the greatest information gain is taken as
the splitting attribute, and the data set is split for all distinct
values of the attribute.
34
36. CART
A Classification and Regression Tree(CART) is a
predictive algorithm used in machine learning.
It explains how a target variable's values can be
predicted based on other values.
It is a decision tree where each fork is a split in a
predictor variable and each node at the end has a
prediction for the target variable.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
37. Decision Tree Induction Methods
SLIQ (1996 — Mehta et al.)
Builds an index for each attribute and only class list and the current
attribute list reside in memory
SPRINT (1996 — J. Shafer et al.)
Constructs an attribute list data structure.
Both the algorithm:
Pre-sort and use attribute-list
Recursively construct the decision tree
Use gini Index
Re-write the dataset – Expensive!
CLOUDS: Approximate version of SPRINT.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
38. PUBLIC (1998 — Rastogi & Shim)
Integrates tree splitting and tree pruning: stop growing the
tree earlier
RainForest (1998 — Gehrke, Ramakrishnan & Ganti)
Builds an AVC-list (attribute, value, class label)
BOAT (1999 — Gehrke, Ganti, Ramakrishnan & Loh)
Uses bootstrapping to create several small samples
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
39. Random Forest
Random Forest is an example of ensemble learning, in which
we combine multiple machine learning algorithms to obtain
better predictive performance.
Two key concepts that give it the name random:
A random sampling of training data set when building trees.
Random subsets of features considered when splitting nodes.
A technique known as bagging is used to create an ensemble of
trees where multiple training sets are generated with
replacement.
In the bagging technique, a data set is divided into N samples
using randomized sampling. Then, using a single learning
algorithm a model is built on all samples. Later, the resultant
predictions are combined using voting or averaging in parallel.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU