unit 5 decision tree2.pptx

Decision Tree
Splitting Indices, Splitting Criteria,
Decision tree construction algorithm
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU

4
Constructing decision trees
 Strategy: top down
Recursive divide-and-conquer fashion
 First: select attribute for root node
Create branch for each possible attribute value
 Then: split instances into subsets
One for each branch extending from the node
 Finally: repeat recursively for each branch, using
only instances that reach the branch
 Stop if all instances have the same class

Play or not?
• The weather
dataset
5

6
Which attribute to select?

Best Split
1. Evaluation of splits for each attribute and
the selection of the best split,
determination of splitting attribute,
2. Determination of splitting condition on
the selected splitting attribute
3. Partitioning the data using best split.

Splitting Indices
 Determining the goodness of a split
1. Information Gain
(From Information theory, entropy)
2. Gini Index
(From economics, measure of diversity )

Computing purity: the information
measure
• information is a measure of a
reduction of uncertainty
• It represents the expected amount of
information that would be needed to
“place” a new instance in the branch.
7

Which attribute to select?

Final decision tree
 Splitting stops when data can’t be split any further

Criterion for attribute selection
 Which is the best attribute?
 Want to get the smallest tree
 Heuristic: choose the attribute that produces the
“purest” nodes

-- Information gain: increases with the average purity of the
subsets
-- Strategy: choose attribute that gives greatest information
gain

How to compute Informaton
Gain: Entropy
1. When the number of either yes OR no is zero (that is
the node is pure) the information is zero.
2. When the number of yes and no is equal, the
information reaches its maximum because we are very
uncertain about the outcome.
3. Complex scenarios: the measure should be
applicable to a multiclass situation, where a multi-
staged decision must be made.
12

Entropy: Formulas
 Formulas for computing entropy:

Entropy: Outlook, sunny
 Formulae for computing the entropy:
= (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445

Measures: Information &
Entropy
• entropy is a probabilistic measure of
uncertainty or ignorance and
information is a measure of a reduction
of uncertainty
• However, in our context we use entropy (ie
the quantity of uncertainty) to measure the
purity of a node.
1
8

Example: Outlook

Computing Information Gain
 Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
 Information gain for attributes from weather data:
gain(Outlook )
gain(Temperature )
gain(Humidity )
gain(Windy )
= 0.247 bits
= 0.029 bits
= 0.152 bits
= 0.048 bits

Information Gain Drawbacks
 Problematic: attributes with a large number
of values (extreme case: ID code)

Weather data with ID code
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No

Tree stump for ID code attribute
 Entropy of split (see Weka book 2011: 105-108):
 Information gain is maximal for ID code (namely 0.940
bits)

Information Gain
Limitations
 Problematic: attributes with a large number
of values (extreme case: ID code)
 Subsets are more likely to be pure if there is
a large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
 (Another problem: fragmentation)

Gain ratio
 Gain ratio: a modification of the information gain
that reduces its bias
 Gain ratio takes number and size of branches into
account when choosing an attribute
 It corrects the information gain by taking the intrinsic
information of a split into account
 Intrinsic information: information about the class is
disregarded.

Gain ratios for weather
data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049

More on the gain ratio
 “Outlook” still comes out top
 However: “ID code” has greater gain ratio
 Standard fix: ad hoc test to prevent splitting on that
type of attribute
 Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix: only consider attributes with greater
than average information gain

Gini index
 All attributes are assumed continuous-
valued
 Assume there exist several possible split
values for each attribute
 May need other tools, such as
clustering, to get the possible split
values
 Can be modified for categorical attributes

Splitting Criteria
 Let attribute A be a numerical-valued attribute Must determine
the best split point for A (BINARY Split)
 Sort the values of A in increasing order
 Typically, the midpoint between each pair of adjacent values is
considered as a possible split point (ai+ai+1)/2 is the midpoint
between the values of ai and ai+1
 The point with the minimum expected information requirement
for A is selected as the split point
Split
 D1 is the set of tuples in D satisfying A ≤ split-point
 D2 is the set of tuples in D satisfying A > split-point

Binary Split
 Numerical Values Attributes
 Examine each possible split point. The midpoint between each pair
of (sorted) adjacent values is taken as a possible split-point
 For each split-point, compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A<=split-point, D2: A> split-
point)
 The point that gives the minimum Gini index for attribute A is
selected as its split-point

Class Histogram
Two class histograms are used to store the class
distribution for numerical attributes.

Binary Split
 Categorical Attributes
 Examine the partitions resulting from all possible subsets of
{a1…,av}
 Each subset SA is a binary test of attribute A of the form
“A∈SA?”
 2^v possible subsets. We exclude the power set and the
empty set, then we have 2^v-2 subsets
 The subset that gives the minimum Gini index for attribute
A is selected as its splitting subset

Count Matrix
The count matrix stores the class distribution of
each value of a categorical attribute.

Decision tree construction algorithm
1. Information Gain
 • ID3
 • C4.5
 • C 5
 • J 48
2. Gini Index
 • SPRINT
 • SLIQ

Iterative Dichotomizer (ID3)
 Quinlan (1986)
 Each node corresponds to a splitting attribute
 Each arc is a possible value of that attribute.
 At each node the splitting attribute is selected to be the most
informative among the attributes not yet considered in the path from
the root.
 Entropy is used to measure how informative is a node.
 The algorithm uses the criterion of information gain to determine the
goodness of a split.
 The attribute with the greatest information gain is taken as
the splitting attribute, and the data set is split for all distinct
values of the attribute.
34

C 4.5

CART
 A Classification and Regression Tree(CART) is a
predictive algorithm used in machine learning.
 It explains how a target variable's values can be
predicted based on other values.
 It is a decision tree where each fork is a split in a
predictor variable and each node at the end has a
prediction for the target variable.

Decision Tree Induction Methods
 SLIQ (1996 — Mehta et al.)
Builds an index for each attribute and only class list and the current
attribute list reside in memory
 SPRINT (1996 — J. Shafer et al.)
Constructs an attribute list data structure.
Both the algorithm:
Pre-sort and use attribute-list
Recursively construct the decision tree
Use gini Index
Re-write the dataset – Expensive!
 CLOUDS: Approximate version of SPRINT.

 PUBLIC (1998 — Rastogi & Shim)
Integrates tree splitting and tree pruning: stop growing the
tree earlier
 RainForest (1998 — Gehrke, Ramakrishnan & Ganti)
Builds an AVC-list (attribute, value, class label)
 BOAT (1999 — Gehrke, Ganti, Ramakrishnan & Loh)
Uses bootstrapping to create several small samples

Random Forest
 Random Forest is an example of ensemble learning, in which
we combine multiple machine learning algorithms to obtain
better predictive performance.
Two key concepts that give it the name random:
 A random sampling of training data set when building trees.
 Random subsets of features considered when splitting nodes.
A technique known as bagging is used to create an ensemble of
trees where multiple training sets are generated with
replacement.
In the bagging technique, a data set is divided into N samples
using randomized sampling. Then, using a single learning
algorithm a model is built on all samples. Later, the resultant
predictions are combined using voting or averaging in parallel.

The
End
40

unit 5 decision tree2.pptx

Recommended

Recommended

More Related Content

Similar to unit 5 decision tree2.pptx

Similar to unit 5 decision tree2.pptx (20)

Recently uploaded

Recently uploaded (20)

unit 5 decision tree2.pptx