SlideShare uma empresa Scribd logo
1 de 23
Data Mining
Classification
Prithwis Mukerjee, Ph.D.
Prithwis
Mukerjee 2
Classification
Definition
 The separation or ordering of objects ( or things ) in
classes
A Priori Classification
 When the classification is done before you have looked at
the data
Post Priori Classification
 When the classification is done after you have looked at
the data
Prithwis
Mukerjee 3
General approach
You decide on the classes without looking at
the data
 For example : High risk, medium risk, low risk classes
You “train” system
 Take a small set of objects – the training set
 Each object has a set of attributes
 Classify the objects in this small (“training”) set into the
three classes, without looking at the attributes
 You will need human expertise here, to classify objects
 Now find a set of rules based on the attributes such that
the system classifies the objects just as you have done
without looking at the attributes
Use these rules to classify the full set of
attributes
Prithwis
Mukerjee 4
If we have this data ...
Name Eggs Pouch Flies Feathers Class
Cockatoo Yes No Yes Yes Bird
No No No No Mammal
Yes Yes No No Marsupial
Emu Yes No No Yes Bird
Kangaroo No Yes No No Marsupial
Koala No Yes No No Marsupial
Yes No Yes Yes Bird
Owl Yes No Yes Yes Bird
Penguin Yes No No Yes Bird
Platypus Yes No No No Mammal
Possum No Yes No No Marsupial
Wombat No Yes No No Marsupial
Dugong
Echidna
Kokkabura
Prithwis
Mukerjee 5
We need to build a decision tree like ....
Pouch ?Pouch ?
Feathers ?Feathers ?
Bird Mammal
Marsupial
YES
YES
NO
NO
Prithwis
Mukerjee 6
Question is ...
Why did we ignore
two attributes ?
 Flies ?
 Feathers ?
Why did we use the
attribute called
POUCH first ?
 And then we used the
attribute called
FEATHERS
A rigorous
classification process
should tell us
 If there are lots of
attributes to be looked at
then which are the
important ones ?
 In which order should we
look at the attributes
So that the
classification arrived
at is very similar to
the classification done
with the training set
Prithwis
Mukerjee 7
Decision Tree : Tree Induction Algorithm
Step 1 : Place all members into one node
 If all members belong to the same class
 Stop : there is nothing to be done
Step 2 : Else
 Choose one attribute and based on its value split the node
into two nodes
 For each of the two nodes
 If all members belong to the same class
 Stop
 Else : Recursively go to Step 1
Big question : How do you choose which
attribute to split a node on ?
 Information Theory
 GINI Index
Prithwis
Mukerjee 8
Information Theory : Recapitulate
Information Content I
 Of an event E
 That has n possible outcomes
 Where outcome i happens with probability pi
 Is defined as I = Σi
( - pi
log2
pi
)
Example :
 Event EA
has two possible outcomes
 P1
= 0, P2
= 0 : Outcome 1 is a certainty
 I = 0 because there is NO information in the outcome
 Event EB
has two possible outcomes
 P1
= 0.5, P2
= 0.5 : Both outcomes are equally likely
 I = -0.5 log2
(0.5) – 0.5 log2
(0.5) = 1
 Maximum possible information that is possible for an event
with two outcomes
Prithwis
Mukerjee 9
Information in the roll of a dice
Fair dice
 All numbers 1 – 6 equally probable ( pi
= 1/6)
 I = 6 x (- 1/6) log2
(1/6) = 2.585
Loaded Dice Case 1
 P6
= 0.5; P1
= P2
= P3
= P4
= P5
= 0.1
 I = 5 x (-0.1) log2
(0.1) – 0.5 x log2
(0.5) = 2.16
Loaded Dice Case 2
 P6
= 0.75; P1
= P2
= P3
= P4
= P5
= 0.05
 I = 5 x (-0.05) log2
(0.1) – 0.75 x log2
(0.75) = 1.39
Point to note ...
 We can change the information in the roll of the dice by
changing the probabilities of the various outcomes !
Prithwis
Mukerjee 10
How do we change the information ?
In a dice
 We make mechanical
modifications so that the
probabilities of each
outcome changes
 This is higly illegal
In a set of individuals
 We regroup the
individuals into the
classes so that the
probability of each class
changes
 This is highly permitted
in our algorithm
H
Prithwis
Mukerjee 11
Consider the following scenario ..
Probability of each outcome ( or class )
 P(A) = 3/10 , P(B) = 3/10 , P(C) = 4/10
Total Information Content of Set S
 -(3/10) log2
(3/10) – (3/10) log2
(3/10) – (4/10) log2
(4/10) = 1.57
ID Home Married Gender Employed Credit Class
1 Yes Yes Male Yes A B
2 No No Female Yes A A
3 Yes Yes Female Yes B C
4 Yes No Male No B B
5 No Yes Female Yes B C
6 No No Female Yes B A
7 No No Male No B B
8 Yes No Female Yes A A
9 No Yes Female Yes A C
10 Yes Yes Female Yes A C
Prithwis
Mukerjee 12
Suppose we split this set on HOME
I1
: Information in set S1
 -(2/5)log2
(2/5) – (1/5) log2
(1/5) – (2/5) log2
(2/5) = 1.52
I2
: Information in set S2
 -(1/5)log2
(1/5) – (2/5) log2
(2/5) – (2/5) log2
(2/5) = 1.52
Total Information in S1
and S2
 0.5 I1
+ 0.5I2
= 0.5 x 1.52 + 0.5 x 1.52 = 1.52
ID Home Married Gender Employed Credit Class
2 No No Female Yes A A
5 No Yes Female Yes B C
6 No No Female Yes B A
7 No No Male No B B
9 No Yes Female Yes A C
ID Home Married Gender Employed Credit Class
1 Yes Yes Male Yes A B
3 Yes Yes Female Yes B C
4 Yes No Male No B B
8 Yes No Female Yes A A
10 Yes Yes Female Yes A C
P1
(A) = 2/5
P1
(B) = 1/5
P1
(C) = 2/5
P2
(A) = 1/5
P2
(B) = 2/5
P2
(C) = 2/5
Prithwis
Mukerjee 13
Impact of HOME attribute
In sets S1
and S2
, the
attribute HOME was
the same
But in set S the
attribute HOME is not
the same and so is of
some significance
What is the
significance of the
HOME attribute ?
By adding the HOME
attribute we have
increased the
information content
 FROM : 1.52
 TO : 1.57
So HOME attribute
adds 0.05 to the
overall information
content
 Or HOME attribute
reduces uncertainty by
0.05
Prithwis
Mukerjee 14
Let us go back to the original set S ..
Probability of each outcome ( or class )
 P(A) = 3/10 , P(B) = 3/10 , P(C) = 4/10
Total Information Content of Set S
 -(3/10) log2
(3/10) – (3/10) log2
(3/10) – (4/10) log2
(4/10) = 1.57
ID Home Married Gender Employed Credit Class
1 Yes Yes Male Yes A B
2 No No Female Yes A A
3 Yes Yes Female Yes B C
4 Yes No Male No B B
5 No Yes Female Yes B C
6 No No Female Yes B A
7 No No Male No B B
8 Yes No Female Yes A A
9 No Yes Female Yes A C
10 Yes Yes Female Yes A C
Prithwis
Mukerjee 15
This time we split on GENDER
I1
: Information in set S1
 -(3/7)log2
(3/7) – (4/7) log2
(4/7) = 0.985
I2
: Information in set S2
 = 0
Total Information in S1
and S2
 (7/10) I1
+ (3/10)I2
= 7/10 x 0.985 + 3/10 x 0 = 0.69
ID Home Married Gender Employed Credit Class
2 No No Female Yes A A
3 Yes Yes Female Yes B C
5 No Yes Female Yes B C
6 No No Female Yes B A
8 Yes No Female Yes A A
9 No Yes Female Yes A C
10 Yes Yes Female Yes A C
ID Home Married Gender Employed Credit Class
1 Yes Yes Male Yes A B
4 Yes No Male No B B
7 No No Male No B B
P1
(A) = 3/7
P1
(B) = 0/7
P1
(C) = 4/7
P2
(A) = 0/3
P2
(B) = 3/3
P2
(C) = 0/3
Prithwis
Mukerjee 16
Impact of GENDER attribute
In sets S1
and S2
, the
attribute GENDER
was the same
But in set S the
attribute GENDER is
not the same and so
is of some
significance
What is the
significance of the
GENDER attribute ?
By adding the
GENDER attribute we
have increased the
information content
 FROM : 0.69
 TO : 1.57
So GENDER attribute
adds 0.88 to the
overall information
content
 Or GENDER attribute
reduces uncertainty by
0.88
Prithwis
Mukerjee 17
If we were to do this for all attributes ...
We would observe that GENDER is the best
candidate for the split
Attribute
Home 1.57 1.52 0.05
Married 1.57 0.85 0.72
Gender 1.57 0.69 0.88
Employed 1.57 1.12 0.45
Credit 1.57 1.52 0.05
Information
before Split
Information
after Split
Information
Gain
Prithwis
Mukerjee 18
And the first part of our tree would be ...
GenderGender
What Next ?What Next ? Class B
MaleFemale
Prithwis
Mukerjee 19
Remove GENDER and Class B and
continue
ID Home Married Employed Credit Class
2 No No Yes A A
3 Yes Yes Yes B C
5 No Yes Yes B C
6 No No Yes B A
8 Yes No Yes A A
9 No Yes Yes A C
10 Yes Yes Yes A C
Probability of each outcome ( or class )
 P(A) = 3/7 , P(C) = 4/7
Total Information Content of Set S
 -(3/7) log2
(3/7) – (4/7) log2
(4/7) = 1.33
Prithwis
Mukerjee 20
We split this set on HOME ...
I1
: Information in set S1
 -(2/4)log2
(2/4) – (2/4) log2
(2/4) = 1.00
I2
: Information in set S2
 -(1/3)log2
(1/3) – (2/3) log2
(2/3) = 0.92
Total Information in S1
and S2
 (4/7) I1
+ (3/7)I2
= 4/7 x 1.00 + 3/7 x 0.92 = 0.96
ID Home Married Employed Credit Class
2 No No Yes A A
5 No Yes Yes B C
6 No No Yes B A
9 No Yes Yes A C
ID Home Married Employed Credit Class
3 Yes Yes Yes B C
8 Yes No Yes A A
10 Yes Yes Yes A C
P1
(A) = 2/4
P1
(C) = 2/4
P1
(A) = 1/3
P1
(C) = 2/3
Gain
= 1.33 – 0.96
= 0.37
Prithwis
Mukerjee 21
But if we were to split on MARRIED
I1
: Information in set S1
 = 0.0
I2
: Information in set S2
 = 0.0
Total Information in S1
and S2
 = 0.0
ID Home Married Employed Credit Class
2 No No Yes A A
8 Yes No Yes A A
6 No No Yes B A
ID Home Married Employed Credit Class
3 Yes Yes Yes B C
9 No Yes Yes A C
10 Yes Yes Yes A C
5 No Yes Yes B C
P1
(A) = 4/4
P1
(C) = 0/4
P1
(A) = 0/3
P1
(C) = 3/3
Gain
= 1.33 - 0
= 1.33
Prithwis
Mukerjee 22
Two things have happened
With MARRIED
 We have hit the upper limit of information gain
 No other attribute can do any better than this
In The TWO sub sets
 All members belong to the same class
 Either A or C
Hence we STOP here and observe ...
Prithwis
Mukerjee 23
That our DECISION TREE looks like
GenderGender
MarriedMarried
Class C Class A
Class B
Male
YES
Female
NO

Mais conteúdo relacionado

Mais procurados

2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methodsKrish_ver2
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization janani thirupathi
 
Software engineering: design for reuse
Software engineering: design for reuseSoftware engineering: design for reuse
Software engineering: design for reuseMarco Brambilla
 
Mathematical Analysis of Non-Recursive Algorithm.
Mathematical Analysis of Non-Recursive Algorithm.Mathematical Analysis of Non-Recursive Algorithm.
Mathematical Analysis of Non-Recursive Algorithm.mohanrathod18
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data MiningRashmi Bhat
 
Client server s/w Engineering
Client server s/w EngineeringClient server s/w Engineering
Client server s/w EngineeringRajan Shah
 
Distributed design alternatives
Distributed design alternativesDistributed design alternatives
Distributed design alternativesPooja Dixit
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and predictionAcad
 
Chapter 3 Image Processing: Basic Transformation
Chapter 3 Image Processing:  Basic TransformationChapter 3 Image Processing:  Basic Transformation
Chapter 3 Image Processing: Basic TransformationVarun Ojha
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
OOAD UNIT I UML DIAGRAMS
OOAD UNIT I UML DIAGRAMSOOAD UNIT I UML DIAGRAMS
OOAD UNIT I UML DIAGRAMSMikel Raj
 
4.intensity transformations
4.intensity transformations4.intensity transformations
4.intensity transformationsYahya Alkhaldi
 
3 Tier Architecture
3 Tier Architecture3 Tier Architecture
3 Tier Architectureguestd0cc01
 
Security in distributed systems
Security in distributed systems Security in distributed systems
Security in distributed systems Haitham Ahmed
 
Stages of image processing
Stages of image processingStages of image processing
Stages of image processingAmal Mp
 

Mais procurados (20)

2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
APRIORI ALGORITHM -PPT.pptx
APRIORI ALGORITHM -PPT.pptxAPRIORI ALGORITHM -PPT.pptx
APRIORI ALGORITHM -PPT.pptx
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
 
Software engineering: design for reuse
Software engineering: design for reuseSoftware engineering: design for reuse
Software engineering: design for reuse
 
Mathematical Analysis of Non-Recursive Algorithm.
Mathematical Analysis of Non-Recursive Algorithm.Mathematical Analysis of Non-Recursive Algorithm.
Mathematical Analysis of Non-Recursive Algorithm.
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
 
Client server s/w Engineering
Client server s/w EngineeringClient server s/w Engineering
Client server s/w Engineering
 
Distributed design alternatives
Distributed design alternativesDistributed design alternatives
Distributed design alternatives
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
 
Chapter 3 Image Processing: Basic Transformation
Chapter 3 Image Processing:  Basic TransformationChapter 3 Image Processing:  Basic Transformation
Chapter 3 Image Processing: Basic Transformation
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
OOAD UNIT I UML DIAGRAMS
OOAD UNIT I UML DIAGRAMSOOAD UNIT I UML DIAGRAMS
OOAD UNIT I UML DIAGRAMS
 
4.intensity transformations
4.intensity transformations4.intensity transformations
4.intensity transformations
 
Ordbms
OrdbmsOrdbms
Ordbms
 
3 Tier Architecture
3 Tier Architecture3 Tier Architecture
3 Tier Architecture
 
Security in distributed systems
Security in distributed systems Security in distributed systems
Security in distributed systems
 
Machine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-offMachine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-off
 
Stages of image processing
Stages of image processingStages of image processing
Stages of image processing
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Mc culloch pitts neuron
Mc culloch pitts neuronMc culloch pitts neuron
Mc culloch pitts neuron
 

Destaque

Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Prithwis Mukerjee
 
Business Intelligence Industry Perspective Session I
Business Intelligence   Industry Perspective Session IBusiness Intelligence   Industry Perspective Session I
Business Intelligence Industry Perspective Session IPrithwis Mukerjee
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesSubhayan Mukerjee
 
The incompleteness of reason
The incompleteness of reasonThe incompleteness of reason
The incompleteness of reasonSubhayan Mukerjee
 
Tintin and Contemporary Politics
Tintin and Contemporary PoliticsTintin and Contemporary Politics
Tintin and Contemporary PoliticsSubhayan Mukerjee
 
ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?Prithwis Mukerjee
 
Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Prithwis Mukerjee
 

Destaque (10)

Data mining intro-2009-v2
Data mining intro-2009-v2Data mining intro-2009-v2
Data mining intro-2009-v2
 
Data mining arm-2009-v0
Data mining arm-2009-v0Data mining arm-2009-v0
Data mining arm-2009-v0
 
Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3
 
Business Intelligence Industry Perspective Session I
Business Intelligence   Industry Perspective Session IBusiness Intelligence   Industry Perspective Session I
Business Intelligence Industry Perspective Session I
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector Machines
 
The incompleteness of reason
The incompleteness of reasonThe incompleteness of reason
The incompleteness of reason
 
Tintin and Contemporary Politics
Tintin and Contemporary PoliticsTintin and Contemporary Politics
Tintin and Contemporary Politics
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?
 
Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2
 

Semelhante a Data mining classification-2009-v0

Classification decision tree
Classification  decision treeClassification  decision tree
Classification decision treeyazad dumasia
 
1_Introduction_printable.pdf
1_Introduction_printable.pdf1_Introduction_printable.pdf
1_Introduction_printable.pdfElio Laureano
 
Statistics assignment
Statistics assignmentStatistics assignment
Statistics assignmentBrian Miles
 
Dr. Oner CelepcikayITS 632ITS 632Week 4Classification
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDr. Oner CelepcikayITS 632ITS 632Week 4Classification
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDustiBuckner14
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)Rai University
 

Semelhante a Data mining classification-2009-v0 (9)

Classification decision tree
Classification  decision treeClassification  decision tree
Classification decision tree
 
1_Introduction_printable.pdf
1_Introduction_printable.pdf1_Introduction_printable.pdf
1_Introduction_printable.pdf
 
QUARTILES.pptx
QUARTILES.pptxQUARTILES.pptx
QUARTILES.pptx
 
Decision tree
Decision treeDecision tree
Decision tree
 
Statistics assignment
Statistics assignmentStatistics assignment
Statistics assignment
 
Machine learning
Machine learningMachine learning
Machine learning
 
Dr. Oner CelepcikayITS 632ITS 632Week 4Classification
Dr. Oner CelepcikayITS 632ITS 632Week 4ClassificationDr. Oner CelepcikayITS 632ITS 632Week 4Classification
Dr. Oner CelepcikayITS 632ITS 632Week 4Classification
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
Bsc cs ii-dbms- u-iii-data modeling using e.r. model (entity relationship model)
 

Mais de Prithwis Mukerjee

Mais de Prithwis Mukerjee (20)

Thought controlled devices
Thought controlled devicesThought controlled devices
Thought controlled devices
 
Cloudcasting
CloudcastingCloudcasting
Cloudcasting
 
Currency, Commodity and Bitcoins
Currency, Commodity and BitcoinsCurrency, Commodity and Bitcoins
Currency, Commodity and Bitcoins
 
Data Science
Data ScienceData Science
Data Science
 
05 OLAP v6 weekend
05 OLAP  v6 weekend05 OLAP  v6 weekend
05 OLAP v6 weekend
 
04 Dimensional Analysis - v6
04 Dimensional Analysis - v604 Dimensional Analysis - v6
04 Dimensional Analysis - v6
 
Thought control
Thought controlThought control
Thought control
 
World of data @ praxis 2013 v2
World of data   @ praxis 2013  v2World of data   @ praxis 2013  v2
World of data @ praxis 2013 v2
 
BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2
 
Lecture02 - Data Mining & Analytics
Lecture02 - Data Mining & AnalyticsLecture02 - Data Mining & Analytics
Lecture02 - Data Mining & Analytics
 
Data mining clustering-2009-v0
Data mining clustering-2009-v0Data mining clustering-2009-v0
Data mining clustering-2009-v0
 
PPM Lite
PPM LitePPM Lite
PPM Lite
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 
Datawarehousing and Business Intelligence
Datawarehousing and Business IntelligenceDatawarehousing and Business Intelligence
Datawarehousing and Business Intelligence
 
Business Models for Web 2.0
Business Models for Web 2.0Business Models for Web 2.0
Business Models for Web 2.0
 
BIS01 Living On the Web
BIS01 Living On the WebBIS01 Living On the Web
BIS01 Living On the Web
 
BIS03 Data Modelling - I
BIS03 Data Modelling - IBIS03 Data Modelling - I
BIS03 Data Modelling - I
 
BIS04 Data Modelling - II
BIS04 Data Modelling  - IIBIS04 Data Modelling  - II
BIS04 Data Modelling - II
 
BIS06 Physical Database Models
BIS06 Physical Database ModelsBIS06 Physical Database Models
BIS06 Physical Database Models
 

Último

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Data mining classification-2009-v0

  • 2. Prithwis Mukerjee 2 Classification Definition  The separation or ordering of objects ( or things ) in classes A Priori Classification  When the classification is done before you have looked at the data Post Priori Classification  When the classification is done after you have looked at the data
  • 3. Prithwis Mukerjee 3 General approach You decide on the classes without looking at the data  For example : High risk, medium risk, low risk classes You “train” system  Take a small set of objects – the training set  Each object has a set of attributes  Classify the objects in this small (“training”) set into the three classes, without looking at the attributes  You will need human expertise here, to classify objects  Now find a set of rules based on the attributes such that the system classifies the objects just as you have done without looking at the attributes Use these rules to classify the full set of attributes
  • 4. Prithwis Mukerjee 4 If we have this data ... Name Eggs Pouch Flies Feathers Class Cockatoo Yes No Yes Yes Bird No No No No Mammal Yes Yes No No Marsupial Emu Yes No No Yes Bird Kangaroo No Yes No No Marsupial Koala No Yes No No Marsupial Yes No Yes Yes Bird Owl Yes No Yes Yes Bird Penguin Yes No No Yes Bird Platypus Yes No No No Mammal Possum No Yes No No Marsupial Wombat No Yes No No Marsupial Dugong Echidna Kokkabura
  • 5. Prithwis Mukerjee 5 We need to build a decision tree like .... Pouch ?Pouch ? Feathers ?Feathers ? Bird Mammal Marsupial YES YES NO NO
  • 6. Prithwis Mukerjee 6 Question is ... Why did we ignore two attributes ?  Flies ?  Feathers ? Why did we use the attribute called POUCH first ?  And then we used the attribute called FEATHERS A rigorous classification process should tell us  If there are lots of attributes to be looked at then which are the important ones ?  In which order should we look at the attributes So that the classification arrived at is very similar to the classification done with the training set
  • 7. Prithwis Mukerjee 7 Decision Tree : Tree Induction Algorithm Step 1 : Place all members into one node  If all members belong to the same class  Stop : there is nothing to be done Step 2 : Else  Choose one attribute and based on its value split the node into two nodes  For each of the two nodes  If all members belong to the same class  Stop  Else : Recursively go to Step 1 Big question : How do you choose which attribute to split a node on ?  Information Theory  GINI Index
  • 8. Prithwis Mukerjee 8 Information Theory : Recapitulate Information Content I  Of an event E  That has n possible outcomes  Where outcome i happens with probability pi  Is defined as I = Σi ( - pi log2 pi ) Example :  Event EA has two possible outcomes  P1 = 0, P2 = 0 : Outcome 1 is a certainty  I = 0 because there is NO information in the outcome  Event EB has two possible outcomes  P1 = 0.5, P2 = 0.5 : Both outcomes are equally likely  I = -0.5 log2 (0.5) – 0.5 log2 (0.5) = 1  Maximum possible information that is possible for an event with two outcomes
  • 9. Prithwis Mukerjee 9 Information in the roll of a dice Fair dice  All numbers 1 – 6 equally probable ( pi = 1/6)  I = 6 x (- 1/6) log2 (1/6) = 2.585 Loaded Dice Case 1  P6 = 0.5; P1 = P2 = P3 = P4 = P5 = 0.1  I = 5 x (-0.1) log2 (0.1) – 0.5 x log2 (0.5) = 2.16 Loaded Dice Case 2  P6 = 0.75; P1 = P2 = P3 = P4 = P5 = 0.05  I = 5 x (-0.05) log2 (0.1) – 0.75 x log2 (0.75) = 1.39 Point to note ...  We can change the information in the roll of the dice by changing the probabilities of the various outcomes !
  • 10. Prithwis Mukerjee 10 How do we change the information ? In a dice  We make mechanical modifications so that the probabilities of each outcome changes  This is higly illegal In a set of individuals  We regroup the individuals into the classes so that the probability of each class changes  This is highly permitted in our algorithm H
  • 11. Prithwis Mukerjee 11 Consider the following scenario .. Probability of each outcome ( or class )  P(A) = 3/10 , P(B) = 3/10 , P(C) = 4/10 Total Information Content of Set S  -(3/10) log2 (3/10) – (3/10) log2 (3/10) – (4/10) log2 (4/10) = 1.57 ID Home Married Gender Employed Credit Class 1 Yes Yes Male Yes A B 2 No No Female Yes A A 3 Yes Yes Female Yes B C 4 Yes No Male No B B 5 No Yes Female Yes B C 6 No No Female Yes B A 7 No No Male No B B 8 Yes No Female Yes A A 9 No Yes Female Yes A C 10 Yes Yes Female Yes A C
  • 12. Prithwis Mukerjee 12 Suppose we split this set on HOME I1 : Information in set S1  -(2/5)log2 (2/5) – (1/5) log2 (1/5) – (2/5) log2 (2/5) = 1.52 I2 : Information in set S2  -(1/5)log2 (1/5) – (2/5) log2 (2/5) – (2/5) log2 (2/5) = 1.52 Total Information in S1 and S2  0.5 I1 + 0.5I2 = 0.5 x 1.52 + 0.5 x 1.52 = 1.52 ID Home Married Gender Employed Credit Class 2 No No Female Yes A A 5 No Yes Female Yes B C 6 No No Female Yes B A 7 No No Male No B B 9 No Yes Female Yes A C ID Home Married Gender Employed Credit Class 1 Yes Yes Male Yes A B 3 Yes Yes Female Yes B C 4 Yes No Male No B B 8 Yes No Female Yes A A 10 Yes Yes Female Yes A C P1 (A) = 2/5 P1 (B) = 1/5 P1 (C) = 2/5 P2 (A) = 1/5 P2 (B) = 2/5 P2 (C) = 2/5
  • 13. Prithwis Mukerjee 13 Impact of HOME attribute In sets S1 and S2 , the attribute HOME was the same But in set S the attribute HOME is not the same and so is of some significance What is the significance of the HOME attribute ? By adding the HOME attribute we have increased the information content  FROM : 1.52  TO : 1.57 So HOME attribute adds 0.05 to the overall information content  Or HOME attribute reduces uncertainty by 0.05
  • 14. Prithwis Mukerjee 14 Let us go back to the original set S .. Probability of each outcome ( or class )  P(A) = 3/10 , P(B) = 3/10 , P(C) = 4/10 Total Information Content of Set S  -(3/10) log2 (3/10) – (3/10) log2 (3/10) – (4/10) log2 (4/10) = 1.57 ID Home Married Gender Employed Credit Class 1 Yes Yes Male Yes A B 2 No No Female Yes A A 3 Yes Yes Female Yes B C 4 Yes No Male No B B 5 No Yes Female Yes B C 6 No No Female Yes B A 7 No No Male No B B 8 Yes No Female Yes A A 9 No Yes Female Yes A C 10 Yes Yes Female Yes A C
  • 15. Prithwis Mukerjee 15 This time we split on GENDER I1 : Information in set S1  -(3/7)log2 (3/7) – (4/7) log2 (4/7) = 0.985 I2 : Information in set S2  = 0 Total Information in S1 and S2  (7/10) I1 + (3/10)I2 = 7/10 x 0.985 + 3/10 x 0 = 0.69 ID Home Married Gender Employed Credit Class 2 No No Female Yes A A 3 Yes Yes Female Yes B C 5 No Yes Female Yes B C 6 No No Female Yes B A 8 Yes No Female Yes A A 9 No Yes Female Yes A C 10 Yes Yes Female Yes A C ID Home Married Gender Employed Credit Class 1 Yes Yes Male Yes A B 4 Yes No Male No B B 7 No No Male No B B P1 (A) = 3/7 P1 (B) = 0/7 P1 (C) = 4/7 P2 (A) = 0/3 P2 (B) = 3/3 P2 (C) = 0/3
  • 16. Prithwis Mukerjee 16 Impact of GENDER attribute In sets S1 and S2 , the attribute GENDER was the same But in set S the attribute GENDER is not the same and so is of some significance What is the significance of the GENDER attribute ? By adding the GENDER attribute we have increased the information content  FROM : 0.69  TO : 1.57 So GENDER attribute adds 0.88 to the overall information content  Or GENDER attribute reduces uncertainty by 0.88
  • 17. Prithwis Mukerjee 17 If we were to do this for all attributes ... We would observe that GENDER is the best candidate for the split Attribute Home 1.57 1.52 0.05 Married 1.57 0.85 0.72 Gender 1.57 0.69 0.88 Employed 1.57 1.12 0.45 Credit 1.57 1.52 0.05 Information before Split Information after Split Information Gain
  • 18. Prithwis Mukerjee 18 And the first part of our tree would be ... GenderGender What Next ?What Next ? Class B MaleFemale
  • 19. Prithwis Mukerjee 19 Remove GENDER and Class B and continue ID Home Married Employed Credit Class 2 No No Yes A A 3 Yes Yes Yes B C 5 No Yes Yes B C 6 No No Yes B A 8 Yes No Yes A A 9 No Yes Yes A C 10 Yes Yes Yes A C Probability of each outcome ( or class )  P(A) = 3/7 , P(C) = 4/7 Total Information Content of Set S  -(3/7) log2 (3/7) – (4/7) log2 (4/7) = 1.33
  • 20. Prithwis Mukerjee 20 We split this set on HOME ... I1 : Information in set S1  -(2/4)log2 (2/4) – (2/4) log2 (2/4) = 1.00 I2 : Information in set S2  -(1/3)log2 (1/3) – (2/3) log2 (2/3) = 0.92 Total Information in S1 and S2  (4/7) I1 + (3/7)I2 = 4/7 x 1.00 + 3/7 x 0.92 = 0.96 ID Home Married Employed Credit Class 2 No No Yes A A 5 No Yes Yes B C 6 No No Yes B A 9 No Yes Yes A C ID Home Married Employed Credit Class 3 Yes Yes Yes B C 8 Yes No Yes A A 10 Yes Yes Yes A C P1 (A) = 2/4 P1 (C) = 2/4 P1 (A) = 1/3 P1 (C) = 2/3 Gain = 1.33 – 0.96 = 0.37
  • 21. Prithwis Mukerjee 21 But if we were to split on MARRIED I1 : Information in set S1  = 0.0 I2 : Information in set S2  = 0.0 Total Information in S1 and S2  = 0.0 ID Home Married Employed Credit Class 2 No No Yes A A 8 Yes No Yes A A 6 No No Yes B A ID Home Married Employed Credit Class 3 Yes Yes Yes B C 9 No Yes Yes A C 10 Yes Yes Yes A C 5 No Yes Yes B C P1 (A) = 4/4 P1 (C) = 0/4 P1 (A) = 0/3 P1 (C) = 3/3 Gain = 1.33 - 0 = 1.33
  • 22. Prithwis Mukerjee 22 Two things have happened With MARRIED  We have hit the upper limit of information gain  No other attribute can do any better than this In The TWO sub sets  All members belong to the same class  Either A or C Hence we STOP here and observe ...
  • 23. Prithwis Mukerjee 23 That our DECISION TREE looks like GenderGender MarriedMarried Class C Class A Class B Male YES Female NO