SlideShare a Scribd company logo
1 of 48
MV PADMAVATI
BHILAI INSTITUTE OF TECHNOLOGY, DURG, INDIA
MACHINE LEARNING
“Learning denotes changes in a system that ... enable a system to do the
same task … more efficiently the next time.” - Herbert Simon
WHAT IS MACHINE LEARNING
Arthur Samuel described it as: “The field of study that gives
computers the ability to learn from data without being
explicitly programmed.”
KIND OF PROBLEMS WHERE MACHINE LEARNING IS APPLICABLE
Traditional Programming
ProgramData Output
ML
algorithm
Data
Model
(or)
Hypothesis
Data whose output is to be predicted
Predicted
Output
Machine Learning
Data will be divided into training and testing data.
Iris Dataset for Classification
Sepal. length Sepal. width Petal.length Petal.width Class label
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5 3.6 1.4 0.2 Iris-setosa
7 3.2 4.7 1.4 Iris-versicolor
6.4 3.2 4.5 1.5 Iris-versicolor
6.9 3.1 4.9 1.5 Iris-versicolor
5.5 2.3 4 1.3 Iris-versicolor
6.5 2.8 4.6 1.5 Iris-versicolor
6.1 2.8 4 1.3 Iris-versicolor
6.3 2.5 4.9 1.5 Iris-versicolor
6.1 2.8 4.7 1.2 Iris-versicolor
6.4 2.9 4.3 1.3 Iris-versicolor
6.3 3.3 6 2.5 Iris-virginica
5.8 2.7 5.1 1.9 Iris-virginica
7.1 3 5.9 2.1 Iris-virginica
6.3 2.9 5.6 1.8 Iris-virginica
Data will be in the form of .csv, .json, .xml, excel file etc
Where can I get datasets?
• Kaggle Datasets - https://www.kaggle.com/datasets
• Amazon data sets - https://registry.opendata.aws/
• UCI Machine Learning Repository-
https://archive.ics.uci.edu/ml/datasets.html
Many more…..
Prepare your Datasets OR you can get data from
Machine Learning Steps
Machine Learning Tools
• Git and Github
• Python
• Jupyter Notebooks
• Numpy - is mostly used to perform math based operations
during the machine learning process.
• Pandas - to import datasets and manage them
• Matplotlib - We will use this library to plot charts in python.
• scikit-learn is an open source Python machine learning
library
• Many other Python APIs
Python for Machine Learning
Types of Machine Learning
• Supervised (labeled examples)
• Unsupervised (unlabeled examples)
• Reinforcement (reward)- Selects actions and observes consequences.
Supervised learning
•Machine learning takes data as input. lets call this data Training data
•The training data includes both Inputs and Labels(Targets)
•We first train the model with the lots of training data(inputs & targets)
Types of Supervised learning
Classification separates the data, Regression fits the data
Basic Problem: Induce a representation of a function (a
systematic relationship between inputs and outputs) from
examples.
 target function f: X → Y
 example (x, f(x))
 hypothesis g: X → Y such that g(x) = f(x)
x = set of attribute values (attribute-value representation)
Y = set of discrete labels (classification)
Y = continuous values  (regression)
Inductive (Supervised) Learning
Classification
This is a type of problem where we predict the categorical response value
where the data can be separated into specific “classes” (ex: we predict
one of the values in a set of values).
Some examples are :
1. This mail is spam or not?
2. Will it rain today or not?
3. Is this picture a cat or not?
Basically ‘Yes/No’ type questions called binary classification.
Other examples are :
1. This mail is spam or important or promotion?
2. Is this picture a cat or a dog or a tiger?
This type is called multi-class classification.
Iris Flower - 3 Variety Details
Let us first understand the datasets
The data set consists of: 150 samples
3 class labels: species of Iris (Iris setosa, Iris virginica and Iris versicolor)
4 features: Sepal length, Sepal width, Petal length, Petal Width in cm
Iris Dataset for Classification
Sepal. length Sepal. width Petal.length Petal.width Species
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5 3.6 1.4 0.2 Iris-setosa
7 3.2 4.7 1.4 Iris-versicolor
6.4 3.2 4.5 1.5 Iris-versicolor
6.9 3.1 4.9 1.5 Iris-versicolor
5.5 2.3 4 1.3 Iris-versicolor
6.5 2.8 4.6 1.5 Iris-versicolor
6.1 2.8 4 1.3 Iris-versicolor
6.3 2.5 4.9 1.5 Iris-versicolor
6.1 2.8 4.7 1.2 Iris-versicolor
6.4 2.9 4.3 1.3 Iris-versicolor
6.3 3.3 6 2.5 Iris-virginica
5.8 2.7 5.1 1.9 Iris-virginica
7.1 3 5.9 2.1 Iris-virginica
6.3 2.9 5.6 1.8 Iris-virginica
Regression
This is a type of problem where we need to predict the continuous
response value (ex : above we predict number which can vary from
infinity to +infinity)
Some examples are
1. What is the price of house in Opava?
2. What is the value of the stock?
3. What can the temperature tomorrow?
etc… there are tons of things we can predict if we wish.
Predicting Age- Regression Problem
Unsupervised Learning
The training data does not include Targets here so we don’t tell the system
where to go, the system has to understand itself from the data we give.
Clustering
This is a type of problem where we group similar things together. It is similar to
multi class classification but here we don’t provide the labels, the system
understands from data itself and cluster the data.
Some examples are :
1. Given news articles, cluster into different types of news
2. Given a set of tweets, cluster based on content of tweet
3. Given a set of images, cluster them into different objects
You’re running a company, and you want to develop learning algorithms to
address each of two problems.
Problem 1: You have a large inventory of identical items. You want to predict
how many of these items will sell over the next 3 months.
Problem 2: You’d like software to examine individual customer accounts, and for
each account decide if it has been hacked or not.
Should you treat these as classification or as regression problems?
Treat both as classification problems.
Treat problem 1 as a classification problem, problem 2 as a regression
problem.
Treat problem 1 as a regression problem, problem 2 as a classification
problem.
Treat both as regression problems.
Of the following examples, which learning you make use of
3. Given a database of customer data, automatically discover market
segments and group customers into different market segments.
1. Given email labeled as spam/not spam, learn a spam filter.
2. Given a set of news articles found on the web, group them into set of
articles about the same story.
4. Given a dataset of patients diagnosed as either having diabetes or
not, learn to classify new patients as having diabetes or not.
Ans 1: Supervised Learning - Classification
Ans 2: Unsupervised Learning - Clustering
Ans 3: Unsupervised Learning - Clustering
Ans 4: Supervised Learning - Classification
Different Classifiers (Algorithms)
• Logistic Regression
• Decision Tree Classifier
• Support Vector Machines
• K- Nearest Neighborhood
• Linear discriminant analysis
• Gaussian Naive Bayes
Decision Tree
• Decision tree algorithm falls under the category of the
supervised learning.
• They can be used to solve both regression and classification
problems.
• Learned functions are represented as decision trees (or if-
then-else rules)
Decision Tree Representation (Play Tennis or not)
Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong  No
Decision Trees Expressivity
 Decision trees represent a disjunction of conjunctions
(SOP) on constraints on the value of attributes:
(Outlook = Sunny  Humidity = Normal) 
(Outlook = Overcast) 
(Outlook = Rain  Wind = Weak)  Yes (tennis will be played)
Python Code for Classification using Decision Tree
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
// load the datasets
iris=datasets.load_iris()
x=iris.data //predictors
y=iris.target //output labels
// diving data set into training set and testing set
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.5)
// using decision tree classifier
classifier=tree.DecisionTreeClassifier()
classifier.fit(x_train,y_train)
// predictions on test data and print accuracy score
predictions=classifier.predict(x_test)
print(accuracy_score(y_test,predictions))
When to use Decision Trees
 Problem characteristics:
 Instances can be described by attribute value pairs
 Disjunctive hypothesis may be required
 Possibly noisy training data samples
 Robust to errors in training data
 Missing attribute values
 Different classification problems:
 Equipment or medical diagnosis
 Credit risk analysis
Top-down induction of Decision Trees
 The construction of the tree is top-down. The algorithm is
greedy.
 The fundamental question is “which attribute should be tested
next? Which question gives us more information?”
 Select the best attribute
 A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
 The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?
 A statistical property called information gain, measures how well a
given attribute separates the training examples
 Information gain uses the notion of entropy, commonly used in
information theory
 Information gain = expected reduction of entropy
Decision Tree Induction
(Recursively) partition examples according to the best
attribute.
Key Concepts
 entropy
 impurity of a set of examples (entropy = 0 if perfectly homogeneous)
 (#bits needed to encode class of an arbitrary example)
 information gain
 expected reduction in entropy caused by partitioning
Entropy
• In machine learning sense and especially in this case Entropy is the measure of
homogeneity in the data.
• Its value ranges from 0 to 1.
• Its value is close to 0 if all the example belongs to same class and is close to 1 is
there is almost equal split of the data into different classes.
Entropy
Entropy controls how a Decision Tree decides to split the data. It actually
effects how a Decision Tree draws its boundaries.
Entropy in binary classification
Entropy measures the impurity of a collection of examples.
It depends from the distribution of the random variable p.
 S is a collection of training examples
 p+ the proportion of positive examples in S
 p– the proportion of negative examples in S
Entropy (S)  – p+ log2 p+ – p–log2 p–
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) = 1/2 + 1/2 = 1
Note: the log of a number < 1 is negative, 0  p  1, 0  entropy  1
Information Gain as Entropy Reduction
 Information gain is the expected reduction in entropy
caused by partitioning the examples on an attribute.
 The higher the information gain the more effective the
attribute in classifying training data.
 Expected reduction in entropy knowing A
Gain(S, A) = Entropy(S) −  Entropy(Sv)
v  Values(A)
Values(A) possible values for A
Sv subset of S for which A has value v
|Sv|
|S|
Example
Example: expected information gain
 Let
 Values(Wind) = {Weak, Strong}
 S = [9+, 5−] |S|=14
 SWeak = [6+, 2−] | SWeak |=8
 SStrong = [3+, 3−] | SStrong|=6
Entropy(S)=-9/14 log2(9/14) – 5/14 log (5/14)=0.94
Entropy(SWeak)= -6/8 log2(6/8) – 2/8 log (2/8)=0.811
Entropy(SStrong)= -1/2 log2(1/2) – 1/2 log (1/2)=1
 Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong)
= 0.94 − 8/14  0.811 − 6/14  1.00
= 0.048
Which attribute is the best classifier?
Example
First step: which attribute to test at the root?
 Which attribute should be tested at the root?
 Gain(S, Outlook) = 0.246
 Gain(S, Humidity) = 0.151
 Gain(S, Wind) = 0.084
 Gain(S, Temperature) = 0.029
 Outlook provides the best prediction for the target
 Lets grow the tree:
 add to the tree a successor for each possible value of Outlook
 partition the training samples according to the value of Outlook
After first step
Second step
 Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970  3/5  0.0  2/5  0.0 = 0.970
Gain(SSunny, Wind) = 0.970  2/5  1.0  3.5  0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970  2/5  0.0  2/5  1.0  1/5  0.0 = 0.570
 Humidity provides the best prediction for the target
 Lets grow the tree:
 add to the tree a successor for each possible value of Humidity
 partition the training samples according to the value of
Humidity
Second and third steps
{D1, D2, D8}
No
{D9, D11}
Yes
{D4, D5, D10}
Yes
{D6, D14}
No
Noisy Example-D15
D15 Sunny Hot Normal Strong No
Overfitting in decision trees
Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=Strong, PlayTennis=No 
New noisy example causes splitting of second leaf node.
Overfitting in decision tree learning
 Building trees that “adapt too much” to the training
examples may lead to “overfitting”.
Avoid overfitting in Decision Trees
 Two strategies:
1. Stop growing the tree earlier, before perfect classification
2. Allow the tree to overfit the data, and then post-prune the tree
 Each node is a candidate for pruning
 Pruning consists in removing a subtree rooted in a node: the node
becomes a leaf and is assigned the most common classification
 Nodes are pruned iteratively: at each iteration the node whose removal
most increases accuracy on the validation set is pruned.
 Pruning stops when no pruning increases accuracy
Pruning

More Related Content

What's hot

Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic ReasoningJunya Tanaka
 
Heuristic Search Techniques Unit -II.ppt
Heuristic Search Techniques Unit -II.pptHeuristic Search Techniques Unit -II.ppt
Heuristic Search Techniques Unit -II.pptkarthikaparthasarath
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)EdutechLearners
 
Design cycles of pattern recognition
Design cycles of pattern recognitionDesign cycles of pattern recognition
Design cycles of pattern recognitionAl Mamun
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithmsUtkarsh Sharma
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learningmilad abbasi
 
Game Playing in Artificial Intelligence
Game Playing in Artificial IntelligenceGame Playing in Artificial Intelligence
Game Playing in Artificial Intelligencelordmwesh
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Inductive bias
Inductive biasInductive bias
Inductive biasswapnac12
 
Ensemble Method (Bagging Boosting)
Ensemble Method (Bagging Boosting)Ensemble Method (Bagging Boosting)
Ensemble Method (Bagging Boosting)Abdullah al Mamun
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceRamla Sheikh
 

What's hot (20)

Probabilistic Reasoning
Probabilistic ReasoningProbabilistic Reasoning
Probabilistic Reasoning
 
Heuristic Search Techniques Unit -II.ppt
Heuristic Search Techniques Unit -II.pptHeuristic Search Techniques Unit -II.ppt
Heuristic Search Techniques Unit -II.ppt
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
ppt
pptppt
ppt
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Decision tree
Decision treeDecision tree
Decision tree
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Design cycles of pattern recognition
Design cycles of pattern recognitionDesign cycles of pattern recognition
Design cycles of pattern recognition
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithms
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
 
Game Playing in Artificial Intelligence
Game Playing in Artificial IntelligenceGame Playing in Artificial Intelligence
Game Playing in Artificial Intelligence
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
Random forest
Random forestRandom forest
Random forest
 
Inductive bias
Inductive biasInductive bias
Inductive bias
 
Ensemble Method (Bagging Boosting)
Ensemble Method (Bagging Boosting)Ensemble Method (Bagging Boosting)
Ensemble Method (Bagging Boosting)
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial Intelligence
 

Similar to Machine learning and decision trees

Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and typesPadma Metta
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsVidya sagar Sharma
 
AI_06_Machine Learning.pptx
AI_06_Machine Learning.pptxAI_06_Machine Learning.pptx
AI_06_Machine Learning.pptxYousef Aburawi
 
Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersSatyam Jaiswal
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-bestABDUmomo
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptxssuser6654de1
 
Artificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptxArtificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptxChandrakalaV15
 
Root cause of community problem for this discussion, you will i
Root cause of community problem for this discussion, you will iRoot cause of community problem for this discussion, you will i
Root cause of community problem for this discussion, you will issusere73ce3
 
machinecanthink-160226155704.pdf
machinecanthink-160226155704.pdfmachinecanthink-160226155704.pdf
machinecanthink-160226155704.pdfPranavPatil822557
 
Observations
ObservationsObservations
Observationsbutest
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine LearningVedaj Padman
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxAsrithaKorupolu
 

Similar to Machine learning and decision trees (20)

Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and types
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory Concepts
 
AI_06_Machine Learning.pptx
AI_06_Machine Learning.pptxAI_06_Machine Learning.pptx
AI_06_Machine Learning.pptx
 
Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and Answers
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
 
Artificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptxArtificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptx
 
Root cause of community problem for this discussion, you will i
Root cause of community problem for this discussion, you will iRoot cause of community problem for this discussion, you will i
Root cause of community problem for this discussion, you will i
 
Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
 
Machine Can Think
Machine Can ThinkMachine Can Think
Machine Can Think
 
machinecanthink-160226155704.pdf
machinecanthink-160226155704.pdfmachinecanthink-160226155704.pdf
machinecanthink-160226155704.pdf
 
Observations
ObservationsObservations
Observations
 
Classification
ClassificationClassification
Classification
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptx
 

More from Padma Metta

Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionPadma Metta
 
Statistical computing 1
Statistical computing 1Statistical computing 1
Statistical computing 1Padma Metta
 
Statistical computing2
Statistical computing2Statistical computing2
Statistical computing2Padma Metta
 
Kernel density estimation (kde)
Kernel density estimation (kde)Kernel density estimation (kde)
Kernel density estimation (kde)Padma Metta
 
Writing a Research Paper
Writing a Research PaperWriting a Research Paper
Writing a Research PaperPadma Metta
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applicationsPadma Metta
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiPadma Metta
 
HTML and ASP.NET
HTML and ASP.NETHTML and ASP.NET
HTML and ASP.NETPadma Metta
 

More from Padma Metta (8)

Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Statistical computing 1
Statistical computing 1Statistical computing 1
Statistical computing 1
 
Statistical computing2
Statistical computing2Statistical computing2
Statistical computing2
 
Kernel density estimation (kde)
Kernel density estimation (kde)Kernel density estimation (kde)
Kernel density estimation (kde)
 
Writing a Research Paper
Writing a Research PaperWriting a Research Paper
Writing a Research Paper
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applications
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 
HTML and ASP.NET
HTML and ASP.NETHTML and ASP.NET
HTML and ASP.NET
 

Recently uploaded

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 

Recently uploaded (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 

Machine learning and decision trees

  • 1. MV PADMAVATI BHILAI INSTITUTE OF TECHNOLOGY, DURG, INDIA MACHINE LEARNING “Learning denotes changes in a system that ... enable a system to do the same task … more efficiently the next time.” - Herbert Simon
  • 2. WHAT IS MACHINE LEARNING Arthur Samuel described it as: “The field of study that gives computers the ability to learn from data without being explicitly programmed.”
  • 3. KIND OF PROBLEMS WHERE MACHINE LEARNING IS APPLICABLE
  • 4. Traditional Programming ProgramData Output ML algorithm Data Model (or) Hypothesis Data whose output is to be predicted Predicted Output Machine Learning Data will be divided into training and testing data.
  • 5. Iris Dataset for Classification Sepal. length Sepal. width Petal.length Petal.width Class label 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa 5 3.6 1.4 0.2 Iris-setosa 7 3.2 4.7 1.4 Iris-versicolor 6.4 3.2 4.5 1.5 Iris-versicolor 6.9 3.1 4.9 1.5 Iris-versicolor 5.5 2.3 4 1.3 Iris-versicolor 6.5 2.8 4.6 1.5 Iris-versicolor 6.1 2.8 4 1.3 Iris-versicolor 6.3 2.5 4.9 1.5 Iris-versicolor 6.1 2.8 4.7 1.2 Iris-versicolor 6.4 2.9 4.3 1.3 Iris-versicolor 6.3 3.3 6 2.5 Iris-virginica 5.8 2.7 5.1 1.9 Iris-virginica 7.1 3 5.9 2.1 Iris-virginica 6.3 2.9 5.6 1.8 Iris-virginica Data will be in the form of .csv, .json, .xml, excel file etc
  • 6. Where can I get datasets? • Kaggle Datasets - https://www.kaggle.com/datasets • Amazon data sets - https://registry.opendata.aws/ • UCI Machine Learning Repository- https://archive.ics.uci.edu/ml/datasets.html Many more….. Prepare your Datasets OR you can get data from
  • 8. Machine Learning Tools • Git and Github • Python • Jupyter Notebooks • Numpy - is mostly used to perform math based operations during the machine learning process. • Pandas - to import datasets and manage them • Matplotlib - We will use this library to plot charts in python. • scikit-learn is an open source Python machine learning library • Many other Python APIs
  • 10. Types of Machine Learning • Supervised (labeled examples) • Unsupervised (unlabeled examples) • Reinforcement (reward)- Selects actions and observes consequences.
  • 11. Supervised learning •Machine learning takes data as input. lets call this data Training data •The training data includes both Inputs and Labels(Targets) •We first train the model with the lots of training data(inputs & targets)
  • 12. Types of Supervised learning Classification separates the data, Regression fits the data
  • 13. Basic Problem: Induce a representation of a function (a systematic relationship between inputs and outputs) from examples.  target function f: X → Y  example (x, f(x))  hypothesis g: X → Y such that g(x) = f(x) x = set of attribute values (attribute-value representation) Y = set of discrete labels (classification) Y = continuous values  (regression) Inductive (Supervised) Learning
  • 14. Classification This is a type of problem where we predict the categorical response value where the data can be separated into specific “classes” (ex: we predict one of the values in a set of values). Some examples are : 1. This mail is spam or not? 2. Will it rain today or not? 3. Is this picture a cat or not? Basically ‘Yes/No’ type questions called binary classification. Other examples are : 1. This mail is spam or important or promotion? 2. Is this picture a cat or a dog or a tiger? This type is called multi-class classification.
  • 15. Iris Flower - 3 Variety Details Let us first understand the datasets The data set consists of: 150 samples 3 class labels: species of Iris (Iris setosa, Iris virginica and Iris versicolor) 4 features: Sepal length, Sepal width, Petal length, Petal Width in cm
  • 16. Iris Dataset for Classification Sepal. length Sepal. width Petal.length Petal.width Species 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa 5 3.6 1.4 0.2 Iris-setosa 7 3.2 4.7 1.4 Iris-versicolor 6.4 3.2 4.5 1.5 Iris-versicolor 6.9 3.1 4.9 1.5 Iris-versicolor 5.5 2.3 4 1.3 Iris-versicolor 6.5 2.8 4.6 1.5 Iris-versicolor 6.1 2.8 4 1.3 Iris-versicolor 6.3 2.5 4.9 1.5 Iris-versicolor 6.1 2.8 4.7 1.2 Iris-versicolor 6.4 2.9 4.3 1.3 Iris-versicolor 6.3 3.3 6 2.5 Iris-virginica 5.8 2.7 5.1 1.9 Iris-virginica 7.1 3 5.9 2.1 Iris-virginica 6.3 2.9 5.6 1.8 Iris-virginica
  • 17. Regression This is a type of problem where we need to predict the continuous response value (ex : above we predict number which can vary from infinity to +infinity) Some examples are 1. What is the price of house in Opava? 2. What is the value of the stock? 3. What can the temperature tomorrow? etc… there are tons of things we can predict if we wish.
  • 19. Unsupervised Learning The training data does not include Targets here so we don’t tell the system where to go, the system has to understand itself from the data we give.
  • 20. Clustering This is a type of problem where we group similar things together. It is similar to multi class classification but here we don’t provide the labels, the system understands from data itself and cluster the data. Some examples are : 1. Given news articles, cluster into different types of news 2. Given a set of tweets, cluster based on content of tweet 3. Given a set of images, cluster them into different objects
  • 21.
  • 22. You’re running a company, and you want to develop learning algorithms to address each of two problems. Problem 1: You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months. Problem 2: You’d like software to examine individual customer accounts, and for each account decide if it has been hacked or not. Should you treat these as classification or as regression problems? Treat both as classification problems. Treat problem 1 as a classification problem, problem 2 as a regression problem. Treat problem 1 as a regression problem, problem 2 as a classification problem. Treat both as regression problems.
  • 23. Of the following examples, which learning you make use of 3. Given a database of customer data, automatically discover market segments and group customers into different market segments. 1. Given email labeled as spam/not spam, learn a spam filter. 2. Given a set of news articles found on the web, group them into set of articles about the same story. 4. Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not. Ans 1: Supervised Learning - Classification Ans 2: Unsupervised Learning - Clustering Ans 3: Unsupervised Learning - Clustering Ans 4: Supervised Learning - Classification
  • 24. Different Classifiers (Algorithms) • Logistic Regression • Decision Tree Classifier • Support Vector Machines • K- Nearest Neighborhood • Linear discriminant analysis • Gaussian Naive Bayes
  • 25. Decision Tree • Decision tree algorithm falls under the category of the supervised learning. • They can be used to solve both regression and classification problems. • Learned functions are represented as decision trees (or if- then-else rules)
  • 26. Decision Tree Representation (Play Tennis or not) Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong  No
  • 27. Decision Trees Expressivity  Decision trees represent a disjunction of conjunctions (SOP) on constraints on the value of attributes: (Outlook = Sunny  Humidity = Normal)  (Outlook = Overcast)  (Outlook = Rain  Wind = Weak)  Yes (tennis will be played)
  • 28. Python Code for Classification using Decision Tree from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn import tree from sklearn.metrics import accuracy_score // load the datasets iris=datasets.load_iris() x=iris.data //predictors y=iris.target //output labels // diving data set into training set and testing set x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.5) // using decision tree classifier classifier=tree.DecisionTreeClassifier() classifier.fit(x_train,y_train) // predictions on test data and print accuracy score predictions=classifier.predict(x_test) print(accuracy_score(y_test,predictions))
  • 29. When to use Decision Trees  Problem characteristics:  Instances can be described by attribute value pairs  Disjunctive hypothesis may be required  Possibly noisy training data samples  Robust to errors in training data  Missing attribute values  Different classification problems:  Equipment or medical diagnosis  Credit risk analysis
  • 30. Top-down induction of Decision Trees  The construction of the tree is top-down. The algorithm is greedy.  The fundamental question is “which attribute should be tested next? Which question gives us more information?”  Select the best attribute  A descendent node is then created for each possible value of this attribute and examples are partitioned according to this value  The process is repeated for each successor node until all the examples are classified correctly or there are no attributes left
  • 31. Which attribute is the best classifier?  A statistical property called information gain, measures how well a given attribute separates the training examples  Information gain uses the notion of entropy, commonly used in information theory  Information gain = expected reduction of entropy
  • 32. Decision Tree Induction (Recursively) partition examples according to the best attribute. Key Concepts  entropy  impurity of a set of examples (entropy = 0 if perfectly homogeneous)  (#bits needed to encode class of an arbitrary example)  information gain  expected reduction in entropy caused by partitioning
  • 33. Entropy • In machine learning sense and especially in this case Entropy is the measure of homogeneity in the data. • Its value ranges from 0 to 1. • Its value is close to 0 if all the example belongs to same class and is close to 1 is there is almost equal split of the data into different classes.
  • 34. Entropy Entropy controls how a Decision Tree decides to split the data. It actually effects how a Decision Tree draws its boundaries.
  • 35. Entropy in binary classification Entropy measures the impurity of a collection of examples. It depends from the distribution of the random variable p.  S is a collection of training examples  p+ the proportion of positive examples in S  p– the proportion of negative examples in S Entropy (S)  – p+ log2 p+ – p–log2 p– Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0 Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94 Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) = 1/2 + 1/2 = 1 Note: the log of a number < 1 is negative, 0  p  1, 0  entropy  1
  • 36. Information Gain as Entropy Reduction  Information gain is the expected reduction in entropy caused by partitioning the examples on an attribute.  The higher the information gain the more effective the attribute in classifying training data.  Expected reduction in entropy knowing A Gain(S, A) = Entropy(S) −  Entropy(Sv) v  Values(A) Values(A) possible values for A Sv subset of S for which A has value v |Sv| |S|
  • 38. Example: expected information gain  Let  Values(Wind) = {Weak, Strong}  S = [9+, 5−] |S|=14  SWeak = [6+, 2−] | SWeak |=8  SStrong = [3+, 3−] | SStrong|=6 Entropy(S)=-9/14 log2(9/14) – 5/14 log (5/14)=0.94 Entropy(SWeak)= -6/8 log2(6/8) – 2/8 log (2/8)=0.811 Entropy(SStrong)= -1/2 log2(1/2) – 1/2 log (1/2)=1  Information gain due to knowing Wind: Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong) = 0.94 − 8/14  0.811 − 6/14  1.00 = 0.048
  • 39. Which attribute is the best classifier?
  • 41. First step: which attribute to test at the root?  Which attribute should be tested at the root?  Gain(S, Outlook) = 0.246  Gain(S, Humidity) = 0.151  Gain(S, Wind) = 0.084  Gain(S, Temperature) = 0.029  Outlook provides the best prediction for the target  Lets grow the tree:  add to the tree a successor for each possible value of Outlook  partition the training samples according to the value of Outlook
  • 43. Second step  Working on Outlook=Sunny node: Gain(SSunny, Humidity) = 0.970  3/5  0.0  2/5  0.0 = 0.970 Gain(SSunny, Wind) = 0.970  2/5  1.0  3.5  0.918 = 0 .019 Gain(SSunny, Temp.) = 0.970  2/5  0.0  2/5  1.0  1/5  0.0 = 0.570  Humidity provides the best prediction for the target  Lets grow the tree:  add to the tree a successor for each possible value of Humidity  partition the training samples according to the value of Humidity
  • 44. Second and third steps {D1, D2, D8} No {D9, D11} Yes {D4, D5, D10} Yes {D6, D14} No
  • 45. Noisy Example-D15 D15 Sunny Hot Normal Strong No
  • 46. Overfitting in decision trees Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=Strong, PlayTennis=No  New noisy example causes splitting of second leaf node.
  • 47. Overfitting in decision tree learning  Building trees that “adapt too much” to the training examples may lead to “overfitting”.
  • 48. Avoid overfitting in Decision Trees  Two strategies: 1. Stop growing the tree earlier, before perfect classification 2. Allow the tree to overfit the data, and then post-prune the tree  Each node is a candidate for pruning  Pruning consists in removing a subtree rooted in a node: the node becomes a leaf and is assigned the most common classification  Nodes are pruned iteratively: at each iteration the node whose removal most increases accuracy on the validation set is pruned.  Pruning stops when no pruning increases accuracy Pruning