SlideShare a Scribd company logo
1 of 40
Decision Tree
Splitting Indices, Splitting Criteria,
Decision tree construction algorithm
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
4
Constructing decision trees
 Strategy: top down
Recursive divide-and-conquer fashion
 First: select attribute for root node
Create branch for each possible attribute value
 Then: split instances into subsets
One for each branch extending from the node
 Finally: repeat recursively for each branch, using
only instances that reach the branch
 Stop if all instances have the same class
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Play or not?
• The weather
dataset
5
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
6
Which attribute to select?
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Best Split
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
1. Evaluation of splits for each attribute and
the selection of the best split,
determination of splitting attribute,
2. Determination of splitting condition on
the selected splitting attribute
3. Partitioning the data using best split.
Splitting Indices
 Determining the goodness of a split
1. Information Gain
(From Information theory, entropy)
2. Gini Index
(From economics, measure of diversity )
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Computing purity: the information
measure
• information is a measure of a
reduction of uncertainty
• It represents the expected amount of
information that would be needed to
“place” a new instance in the branch.
7
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Which attribute to select?
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Final decision tree
 Splitting stops when data can’t be split any further
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Criterion for attribute selection
 Which is the best attribute?
 Want to get the smallest tree
 Heuristic: choose the attribute that produces the
“purest” nodes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
-- Information gain: increases with the average purity of the
subsets
-- Strategy: choose attribute that gives greatest information
gain
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
How to compute Informaton
Gain: Entropy
1. When the number of either yes OR no is zero (that is
the node is pure) the information is zero.
2. When the number of yes and no is equal, the
information reaches its maximum because we are very
uncertain about the outcome.
3. Complex scenarios: the measure should be
applicable to a multiclass situation, where a multi-
staged decision must be made.
12
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Entropy: Formulas
 Formulas for computing entropy:
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Entropy: Outlook, sunny
 Formulae for computing the entropy:
= (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Measures: Information &
Entropy
• entropy is a probabilistic measure of
uncertainty or ignorance and
information is a measure of a reduction
of uncertainty
• However, in our context we use entropy (ie
the quantity of uncertainty) to measure the
purity of a node.
1
8
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Example: Outlook
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Computing Information Gain
 Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
 Information gain for attributes from weather data:
gain(Outlook )
gain(Temperature )
gain(Humidity )
gain(Windy )
= 0.247 bits
= 0.029 bits
= 0.152 bits
= 0.048 bits
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Information Gain Drawbacks
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Problematic: attributes with a large number
of values (extreme case: ID code)
Weather data with ID code
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No
Tree stump for ID code attribute
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Entropy of split (see Weka book 2011: 105-108):
 Information gain is maximal for ID code (namely 0.940
bits)
Information Gain
Limitations
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Problematic: attributes with a large number
of values (extreme case: ID code)
 Subsets are more likely to be pure if there is
a large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
 (Another problem: fragmentation)
Gain ratio
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Gain ratio: a modification of the information gain
that reduces its bias
 Gain ratio takes number and size of branches into
account when choosing an attribute
 It corrects the information gain by taking the intrinsic
information of a split into account
 Intrinsic information: information about the class is
disregarded.
Gain ratios for weather
data
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
More on the gain ratio
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 “Outlook” still comes out top
 However: “ID code” has greater gain ratio
 Standard fix: ad hoc test to prevent splitting on that
type of attribute
 Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix: only consider attributes with greater
than average information gain
Gini index
 All attributes are assumed continuous-
valued
 Assume there exist several possible split
values for each attribute
 May need other tools, such as
clustering, to get the possible split
values
 Can be modified for categorical attributes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Splitting Criteria
 Let attribute A be a numerical-valued attribute Must determine
the best split point for A (BINARY Split)
 Sort the values of A in increasing order
 Typically, the midpoint between each pair of adjacent values is
considered as a possible split point (ai+ai+1)/2 is the midpoint
between the values of ai and ai+1
 The point with the minimum expected information requirement
for A is selected as the split point
Split
 D1 is the set of tuples in D satisfying A ≤ split-point
 D2 is the set of tuples in D satisfying A > split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Binary Split
 Numerical Values Attributes
 Examine each possible split point. The midpoint between each pair
of (sorted) adjacent values is taken as a possible split-point
 For each split-point, compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A<=split-point, D2: A> split-
point)
 The point that gives the minimum Gini index for attribute A is
selected as its split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Class Histogram
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Two class histograms are used to store the class
distribution for numerical attributes.
Binary Split
 Categorical Attributes
 Examine the partitions resulting from all possible subsets of
{a1…,av}
 Each subset SA is a binary test of attribute A of the form
“A∈SA?”
 2^v possible subsets. We exclude the power set and the
empty set, then we have 2^v-2 subsets
 The subset that gives the minimum Gini index for attribute
A is selected as its splitting subset
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Count Matrix
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The count matrix stores the class distribution of
each value of a categorical attribute.
Decision tree construction algorithm
1. Information Gain
 • ID3
 • C4.5
 • C 5
 • J 48
2. Gini Index
 • SPRINT
 • SLIQ
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Iterative Dichotomizer (ID3)
 Quinlan (1986)
 Each node corresponds to a splitting attribute
 Each arc is a possible value of that attribute.
 At each node the splitting attribute is selected to be the most
informative among the attributes not yet considered in the path from
the root.
 Entropy is used to measure how informative is a node.
 The algorithm uses the criterion of information gain to determine the
goodness of a split.
 The attribute with the greatest information gain is taken as
the splitting attribute, and the data set is split for all distinct
values of the attribute.
34
C 4.5
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
CART
 A Classification and Regression Tree(CART) is a
predictive algorithm used in machine learning.
 It explains how a target variable's values can be
predicted based on other values.
 It is a decision tree where each fork is a split in a
predictor variable and each node at the end has a
prediction for the target variable.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Decision Tree Induction Methods
 SLIQ (1996 — Mehta et al.)
Builds an index for each attribute and only class list and the current
attribute list reside in memory
 SPRINT (1996 — J. Shafer et al.)
Constructs an attribute list data structure.
Both the algorithm:
Pre-sort and use attribute-list
Recursively construct the decision tree
Use gini Index
Re-write the dataset – Expensive!
 CLOUDS: Approximate version of SPRINT.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 PUBLIC (1998 — Rastogi & Shim)
Integrates tree splitting and tree pruning: stop growing the
tree earlier
 RainForest (1998 — Gehrke, Ramakrishnan & Ganti)
Builds an AVC-list (attribute, value, class label)
 BOAT (1999 — Gehrke, Ganti, Ramakrishnan & Loh)
Uses bootstrapping to create several small samples
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Random Forest
 Random Forest is an example of ensemble learning, in which
we combine multiple machine learning algorithms to obtain
better predictive performance.
Two key concepts that give it the name random:
 A random sampling of training data set when building trees.
 Random subsets of features considered when splitting nodes.
A technique known as bagging is used to create an ensemble of
trees where multiple training sets are generated with
replacement.
In the bagging technique, a data set is divided into N samples
using randomized sampling. Then, using a single learning
algorithm a model is built on all samples. Later, the resultant
predictions are combined using voting or averaging in parallel.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The
End
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
40

More Related Content

Similar to unit 5 decision tree2.pptx

Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...butest
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptRvishnupriya2
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptRvishnupriya2
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).pptrajasamal1999
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
 
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.IJERD Editor
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
IRJET- A Data Mining with Big Data Disease Prediction
IRJET-  	  A Data Mining with Big Data Disease PredictionIRJET-  	  A Data Mining with Big Data Disease Prediction
IRJET- A Data Mining with Big Data Disease PredictionIRJET Journal
 
Decision trees
Decision treesDecision trees
Decision treesNcib Lotfi
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
decison tree and rules in data mining techniques
decison tree and rules in data mining techniquesdecison tree and rules in data mining techniques
decison tree and rules in data mining techniquesALIZAIB KHAN
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxPlacementsBCA
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3Laila Fatehy
 

Similar to unit 5 decision tree2.pptx (20)

Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
NETWORJS3.pdf
NETWORJS3.pdfNETWORJS3.pdf
NETWORJS3.pdf
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).ppt
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
PPID3 AICCSA08
PPID3 AICCSA08PPID3 AICCSA08
PPID3 AICCSA08
 
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
 
IRJET- A Data Mining with Big Data Disease Prediction
IRJET-  	  A Data Mining with Big Data Disease PredictionIRJET-  	  A Data Mining with Big Data Disease Prediction
IRJET- A Data Mining with Big Data Disease Prediction
 
Decision trees
Decision treesDecision trees
Decision trees
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
 
decison tree and rules in data mining techniques
decison tree and rules in data mining techniquesdecison tree and rules in data mining techniques
decison tree and rules in data mining techniques
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptx
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
 
Decision tree
Decision treeDecision tree
Decision tree
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

unit 5 decision tree2.pptx

  • 1. Decision Tree Splitting Indices, Splitting Criteria, Decision tree construction algorithm Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 2. 4 Constructing decision trees  Strategy: top down Recursive divide-and-conquer fashion  First: select attribute for root node Create branch for each possible attribute value  Then: split instances into subsets One for each branch extending from the node  Finally: repeat recursively for each branch, using only instances that reach the branch  Stop if all instances have the same class Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 3. Play or not? • The weather dataset 5 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 4. 6 Which attribute to select? Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 5. Best Split Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU 1. Evaluation of splits for each attribute and the selection of the best split, determination of splitting attribute, 2. Determination of splitting condition on the selected splitting attribute 3. Partitioning the data using best split.
  • 6. Splitting Indices  Determining the goodness of a split 1. Information Gain (From Information theory, entropy) 2. Gini Index (From economics, measure of diversity ) Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 7. Computing purity: the information measure • information is a measure of a reduction of uncertainty • It represents the expected amount of information that would be needed to “place” a new instance in the branch. 7 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 8. Which attribute to select? Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 9. Final decision tree  Splitting stops when data can’t be split any further Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 10. Criterion for attribute selection  Which is the best attribute?  Want to get the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 11. -- Information gain: increases with the average purity of the subsets -- Strategy: choose attribute that gives greatest information gain Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 12. How to compute Informaton Gain: Entropy 1. When the number of either yes OR no is zero (that is the node is pure) the information is zero. 2. When the number of yes and no is equal, the information reaches its maximum because we are very uncertain about the outcome. 3. Complex scenarios: the measure should be applicable to a multiclass situation, where a multi- staged decision must be made. 12 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 13. Entropy: Formulas  Formulas for computing entropy: Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 14. Entropy: Outlook, sunny  Formulae for computing the entropy: = (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 15. Measures: Information & Entropy • entropy is a probabilistic measure of uncertainty or ignorance and information is a measure of a reduction of uncertainty • However, in our context we use entropy (ie the quantity of uncertainty) to measure the purity of a node. 1 8 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 16. Example: Outlook Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 17. Computing Information Gain  Information gain: information before splitting – information after splitting gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2]) = 0.940 – 0.693 = 0.247 bits  Information gain for attributes from weather data: gain(Outlook ) gain(Temperature ) gain(Humidity ) gain(Windy ) = 0.247 bits = 0.029 bits = 0.152 bits = 0.048 bits Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 18. Information Gain Drawbacks Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Problematic: attributes with a large number of values (extreme case: ID code)
  • 19. Weather data with ID code Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU ID code Outlook Temp. Humidity Windy Play A Sunny Hot High False No B Sunny Hot High True No C Overcast Hot High False Yes D Rainy Mild High False Yes E Rainy Cool Normal False Yes F Rainy Cool Normal True No G Overcast Cool Normal True Yes H Sunny Mild High False No I Sunny Cool Normal False Yes J Rainy Mild Normal False Yes K Sunny Mild Normal True Yes L Overcast Mild High True Yes M Overcast Hot Normal False Yes N Rainy Mild High True No
  • 20. Tree stump for ID code attribute Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Entropy of split (see Weka book 2011: 105-108):  Information gain is maximal for ID code (namely 0.940 bits)
  • 21. Information Gain Limitations Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Problematic: attributes with a large number of values (extreme case: ID code)  Subsets are more likely to be pure if there is a large number of values  Information gain is biased towards choosing attributes with a large number of values  This may result in overfitting (selection of an attribute that is non-optimal for prediction)  (Another problem: fragmentation)
  • 22. Gain ratio Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Gain ratio: a modification of the information gain that reduces its bias  Gain ratio takes number and size of branches into account when choosing an attribute  It corrects the information gain by taking the intrinsic information of a split into account  Intrinsic information: information about the class is disregarded.
  • 23. Gain ratios for weather data Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU Outlook Temperature Info: 0.693 Info: 0.911 Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029 Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557 Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019 Humidity Windy Info: 0.788 Info: 0.892 Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048 Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985 Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
  • 24. More on the gain ratio Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  “Outlook” still comes out top  However: “ID code” has greater gain ratio  Standard fix: ad hoc test to prevent splitting on that type of attribute  Problem with gain ratio: it may overcompensate  May choose an attribute just because its intrinsic information is very low  Standard fix: only consider attributes with greater than average information gain
  • 25. Gini index  All attributes are assumed continuous- valued  Assume there exist several possible split values for each attribute  May need other tools, such as clustering, to get the possible split values  Can be modified for categorical attributes Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 26. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 27. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 28. Splitting Criteria  Let attribute A be a numerical-valued attribute Must determine the best split point for A (BINARY Split)  Sort the values of A in increasing order  Typically, the midpoint between each pair of adjacent values is considered as a possible split point (ai+ai+1)/2 is the midpoint between the values of ai and ai+1  The point with the minimum expected information requirement for A is selected as the split point Split  D1 is the set of tuples in D satisfying A ≤ split-point  D2 is the set of tuples in D satisfying A > split-point Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 29. Binary Split  Numerical Values Attributes  Examine each possible split point. The midpoint between each pair of (sorted) adjacent values is taken as a possible split-point  For each split-point, compute the weighted sum of the impurity of each of the two resulting partitions (D1: A<=split-point, D2: A> split- point)  The point that gives the minimum Gini index for attribute A is selected as its split-point Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 30. Class Histogram Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU Two class histograms are used to store the class distribution for numerical attributes.
  • 31. Binary Split  Categorical Attributes  Examine the partitions resulting from all possible subsets of {a1…,av}  Each subset SA is a binary test of attribute A of the form “A∈SA?”  2^v possible subsets. We exclude the power set and the empty set, then we have 2^v-2 subsets  The subset that gives the minimum Gini index for attribute A is selected as its splitting subset Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 32. Count Matrix Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU The count matrix stores the class distribution of each value of a categorical attribute.
  • 33. Decision tree construction algorithm 1. Information Gain  • ID3  • C4.5  • C 5  • J 48 2. Gini Index  • SPRINT  • SLIQ Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 34. Iterative Dichotomizer (ID3)  Quinlan (1986)  Each node corresponds to a splitting attribute  Each arc is a possible value of that attribute.  At each node the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root.  Entropy is used to measure how informative is a node.  The algorithm uses the criterion of information gain to determine the goodness of a split.  The attribute with the greatest information gain is taken as the splitting attribute, and the data set is split for all distinct values of the attribute. 34
  • 35. C 4.5 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 36. CART  A Classification and Regression Tree(CART) is a predictive algorithm used in machine learning.  It explains how a target variable's values can be predicted based on other values.  It is a decision tree where each fork is a split in a predictor variable and each node at the end has a prediction for the target variable. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 37. Decision Tree Induction Methods  SLIQ (1996 — Mehta et al.) Builds an index for each attribute and only class list and the current attribute list reside in memory  SPRINT (1996 — J. Shafer et al.) Constructs an attribute list data structure. Both the algorithm: Pre-sort and use attribute-list Recursively construct the decision tree Use gini Index Re-write the dataset – Expensive!  CLOUDS: Approximate version of SPRINT. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 38.  PUBLIC (1998 — Rastogi & Shim) Integrates tree splitting and tree pruning: stop growing the tree earlier  RainForest (1998 — Gehrke, Ramakrishnan & Ganti) Builds an AVC-list (attribute, value, class label)  BOAT (1999 — Gehrke, Ganti, Ramakrishnan & Loh) Uses bootstrapping to create several small samples Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 39. Random Forest  Random Forest is an example of ensemble learning, in which we combine multiple machine learning algorithms to obtain better predictive performance. Two key concepts that give it the name random:  A random sampling of training data set when building trees.  Random subsets of features considered when splitting nodes. A technique known as bagging is used to create an ensemble of trees where multiple training sets are generated with replacement. In the bagging technique, a data set is divided into N samples using randomized sampling. Then, using a single learning algorithm a model is built on all samples. Later, the resultant predictions are combined using voting or averaging in parallel. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 40. The End Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU 40