SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Data Mining

               Rajendra Akerkar



July 7, 2009      Data Mining: R. Akerkar   1
What Is Data Mining?
• Data mining (knowledge discovery from data)
      – Extraction of interesting (non-trivial, implicit,
        previously unknown and potentially useful) patterns or
        knowledge from huge amount of data


• Is everything “data mining”?
      – (Deductive) query processing.
                          processing
      – Expert systems or small ML/statistical programs


July 7, 2009             Data Mining: R. Akerkar                 2
Definition
   • Several Definitions
         – Non-trivial extraction of implicit, previously
           unknown and potentially useful information from data

         – Exploration & analysis, by automatic or
           semi automatic
           semi-automatic means, of
           large quantities of data
           in order to discover
           meaningful patterns



July 7, 2009              Data Mining: R. Akerkar            3
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
                    [ yy ,         ]                    g          y               g,


July 7, 2009            Data Mining: R. Akerkar                                    4
Classification




July 7, 2009    Data Mining: R. Akerkar   5
Classification: D fi iti
                               Definition
• Given a collection of records (training set )
     – Each record contains a set of attributes, one of the
       attributes is the class.
• Find a model for class attribute as a function of the
  values of other attributes.
• Goal: pre io sl unseen records should be assigned
         previously nseen            sho ld
  a class as accurately as possible.
     – A test set is used to determine the accuracy of the model.
                                                    y
       Usually, the given data set is divided into training and
       test sets, with training set used to build the model and
       test set used to validate it.
July 7, 2009              Data Mining: R. Akerkar               6
Classification: Introduction
• A classification scheme which generates a tree
  and a set of rules from given data set.

• The attributes of the records are categorise into
  two types:
  – Attributes whose domain is numerical are called
    numerical attributes.
  – A ib
    Attributes whose domain is not numerical are
                 h    d     i i         i l
    called the categorical attributes.

 July 7, 2009            Data Mining: R. Akerkar      7
Decision Tree

• A decision tree is a tree with the following properties:
   – An inner node represents an attribute.
   – A edge represents a t t on the attribute of the father
     An d                t test th tt ib t f th f th
     node.
   – A leaf represents one of the classes.

• Construction of a decision tree
   – Based on the training data
   – Top-Down strategy


July 7, 2009           Data Mining: R. Akerkar                8
July 7, 2009   Data Mining: R. Akerkar   9
Decision Tree
                          Example
• The data set has five attributes.
• There is a special attribute: the attribute class is the class
  label.
  label
• The attributes, temp (temperature) and humidity are
  numerical attributes
• Other attributes are categorical, that is, they cannot be
                        categorical       is
  ordered.

• Based on the training data set, we want to find a set of
                             set
  rules to know what values of outlook, temperature,
  humidity and wind, determine whether or not to play golf.

July 7, 2009             Data Mining: R. Akerkar                   10
Decision Tree
                             Example

• We have five leaf nodes.
• In a decision tree, each leaf node represents a rule.
                    ,                  p

• We have the following rules corresponding to the tree
  given in Figure.

•   RULE 1     If it is sunny and the humidity is not above 75%, then play.
•   RULE 2     If it is sunny and the humidity is above 75%, then do not play.
                f            y                y              ,           p y
•   RULE 3     If it is overcast, then play.
•   RULE 4     If it is rainy and not windy, then play.
•   RULE 5     If it is rainy and windy, then don't play.


July 7, 2009              Data Mining: R. Akerkar                            11
July 7, 2009   Data Mining: R. Akerkar   12
Iterative Dichotomizer (ID3)
• Quinlan (1986)
• Each d
  E h node corresponds to a splitting attribute
                        d t       litti    tt ib t
• Each arc is a possible value of that attribute.
• At each node the splitting attribute is selected to be the most
  informative among the attributes not yet considered in the path from
  the root.
• Entropy is used to measure how informative is a node.
• The algorithm uses the criterion of information gain to determine
  the goodness of a split.
     – The attribute with the greatest information gain is taken as
                                g                      g
       the splitting attribute, and the data set is split for all distinct
       values of the attribute.


    July 7, 2009              Data Mining: R. Akerkar                  13
Training Dataset
                         This follows an example from Quinlan’s ID3

         age    income   student         credit_rating   buys_computer
    <=30       high        no       fair                      no
    <=30         g
               high        no       excellent                 no
    31…40      high        no       fair                      yes
    >40        medium      no       fair                      yes
    >40        low         yes      fair                      yes
    >40        low         yes      excellent                 no
    31…40      low         yes      excellent                 yes
    <=30       medium      no       fair                      no
    <=30       low         yes      fair                      yes
    >40        medium      yes      fair                      yes
    <=30       medium      yes      excellent                 yes
    31…40      medium      no       excellent                 yes
    31…40        g
               high        y
                           yes      fair                      y
                                                              yes
    >40        medium      no       excellent                 no
July 7, 2009             Data Mining: R. Akerkar                    14
Extracting Classification Rules from
                   Trees
•   Represent the k
    R          h knowledge in the
                     l d i h
    form of IF-THEN rules
•   One rule is created for each path
    from the root to a leaf
•   Each attribute-value pair along a
    path forms a conjunction
•   The leaf node holds the class
    prediction
•   Rules
    R l are easier for humans to
               i f h
    understand                                          What are the rules?


July 7, 2009                  Data Mining: R. Akerkar                   15
Attribute Selection Measure: Information
Gain (ID3/C4.5)
      (ID3/C4 5)

   Select the attribute with the highest information gain
   S contains si tuples of class Ci for i = {1, …, m}
   information measures info required to classify any
                                                    q                y y
    arbitrary tuple                          m
                                                 si   si
                    I( s1,s 2,...,s m )    log 2

                                            i 1 s    s ….information is encoded in bits.
   entropy of attribute A with values {a1,a2,…,av}
                f     b                   h l          {
                                     v
                                      s1 j  ... smj
                           E(A)                     I( s1 j ,...,smj )
                                 j 1         s
   information gained by branching on attribute A

                       Gain(A)  I(s 1, s 2 ,..., sm)  E(A)
July 7, 2009                      Data Mining: R. Akerkar                          16
age             pi      ni  I(pi, ni)               Class P: buys_computer = “yes”
                                                      Cl
                                                       Class N: buys_computer = “no”
                                                              N b             t  “ ”
<=30              2       3 0.971                     I(p, n) = I(9, 5) =0.940
30…40             4       0 0                         Compute the entropy for age:
>40               3       2 0 971
                             0.971
  age       income     student      credit_rating        buys_computer
<=30      high
            g           no       fair                           no
<=30      high          no       excellent                      no
31…40     high          no       fair                           yes
>40       medium        no       fair                           yes
>40       low
           ow           yes      fair                           yes
>40       low           yes      excellent                      no
31…40     low           yes      excellent                      yes
<=30      medium        no       fair                           no
< 30
<=30      low           yes      fair                           yes
>40       medium        yes      fair                           yes
<=30      medium        yes      excellent                      yes
31…40     medium        no       excellent                      yes
31…40
31 40     high          yes      fair                           yes
>40       medium        no       excellent                      no
   July 7, 2009                       Data Mining: R. Akerkar                     17
Attribute Selection by Information Gain
                     Computation
                   5              4
 E ( age ) 
      g              I ( 2 ,3)     I ( 4,0 )
                  14             14
                   5
                    I (3, 2 )  0 .694
                  14

 5
   I ( 2 ,3 )      means “age <=30” has 5 out of 14 samples, with 2 yes's
14
                      and 3 no’s. Hence

      Gain ( age )  I ( p , n )  E ( age )  0.246
                  Similarly,               Gain(income)  0.029
                                           Gain( student )  0.151
                                           Gain(credit _ rating )  0.048

July 7, 2009                           Data Mining: R. Akerkar              18
Exercise 1
• The following table consists of training data from an employee
  database.
  database




• Let status be the class attribute. Use the ID3 algorithm to construct
  a decision tree from the given data.

  July 7, 2009              Data Mining: R. Akerkar                 19
Clustering




July 7, 2009    Data Mining: R. Akerkar   20
Clustering: Definition
• Given a set of data points, each having a set of
  attributes, and a similarity measure among them, find
  clusters such that
    – Data points in one cluster are more similar to one another.
    – Data points in separate clusters are less similar to one
      another.
          h
• Similarity Measures:
    – E lid
      Euclidean Distance if attributes are continuous.
                Di t         tt ib t          ti
    – Other Problem-specific Measures.


 July 7, 2009             Data Mining: R. Akerkar              21
The K-Means Clustering Method
       K Means
• Given k, the k-means algorithm is implemented in
               k means
  four steps:
   – Partition objects into k nonempty subsets
   – Compute seed points as the centroids of the clusters of
     the current partition (the centroid is the center, i.e., mean
                                                center i e
     point, of the cluster)
   – Assign each object to the cluster with the nearest seed
     point
   – Go back to Step 2, stop when no more new assignment
July 7, 2009             Data Mining: R. Akerkar                 22
Visualization of k-means
                                k means
                      algorithm




July 7, 2009           Data Mining: R. Akerkar   23
Exercise 2
• Apply the K-means algorithm for the
  following 1-dimensional points (for k=2): 1;
           g                p      (     ) ;
  2; 3; 4; 6; 7; 8; 9.
• Use 1 and 2 as the starting centroids.
                              centroids




July 7, 2009     Data Mining: R. Akerkar    24
K – Mean for 2-dimensional
                     2 dimensional
                database
• Let us consider {x1, x2, x3, x4, x5} with following coordinates as
  two-dimensional sample for clustering:

• x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2)

• Suppose that required number of clusters is 2.
• Initially, clusters are formed from random distribution of
  samples:
• C1 = {x1, x2, x4} and C2 = {x3, x5}.


July 7, 2009                Data Mining: R. Akerkar                 25
Centroid Calculation
• Suppose that the given set of N samples in an n-dimensional space
  has somehow be partitioned into K clusters {C1, C2, …, Ck}
• Each Ck has nk samples and each sample is exactly in one cluster.
• Therefore,  nk = N, where k = 1, …, K.
• The mean vector Mk of cluster Ck is defined as centroid of the
  cluster,                       nk


                     Mk = (1/ nk)    i = 1 xik
                                                     Where xik is the ith sample belonging
                                                     to cluster Ck.


• In our example The centroids for these two clusters are
         example,
• M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66}
• M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00}
                 ) (       ) } {             }

 July 7, 2009              Data Mining: R. Akerkar                                26
The S
    Th Square-error of th cluster
                     f the l t
• The square-error for cluster Ck is the sum of squared Euclidean
  distances between each sample in Ck and its centroid.
• Thi error is called the within-cluster variation.
  This       i   ll d th ithi l t           i ti

                     ek2 =   n k
                                i=1   (xik – Mk)2

• Within cluster variations, after initial random distribution of
  samples, are
• e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2]
         + [(5 – 1.66)2 + (0 – 0.66)2] = 19.36
• e22 = [(1.5 – 3.25)2 + (0 – 1)2] + [( – 3.25)2 + (2 – 1)2] = 8.12
        [(           ) (        )    [(5       ) (        )

July 7, 2009                 Data Mining: R. Akerkar                  27
Total Square-error
• The square error for the entire clustering space
  containing K clusters is the sum of the within-cluster
  variations. K
     i i
               Ek2 =   
                       k = 1 ek
                               2




• The total square error is
  E2 = e12 + e22 = 19.36 + 8.12 = 27.48

July 7, 2009                  Data Mining: R. Akerkar      28
• When we reassign all samples, depending on a minimum distance
  from centroids M1 and M2, the new redistribution of samples inside
  clusters will be,
                be
• d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40  x1  C1
• d(M1, x2) = 1.79                      and d(M2, x2) = 3.40  x2  C1
  d(M1, x3) = 0.83                      and d(M2, x3) = 2.01  x3  C1
  d(M1, x4) = 3.41                      and d(M2, x4) = 2.01  x4  C2
  d(M1, x5) = 3.60                      and d(M2, x5) = 2.01  x5  C2


    Above calculation is based on Euclidean distance formula,

                       d(xi, xj) = k = 1 (xik – xjk)1/2
                                   m




 July 7, 2009                  Data Mining: R. Akerkar            29
• New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new
  centroids
• M1 = {0.5, 0.67}
• M2 = {5.0, 1.0}

• The corresponding within-cluster variations and the total square
  error are,
• e12 = 4.17
• e22 = 2.00
• E2 = 6.17


July 7, 2009              Data Mining: R. Akerkar                30
Exercise 3
Let the set X consist of the following sample points in 2
  dimensional space:

X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)}

Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of
  centroids for X.
What are the revised values of c1 and c2 after 1 iteration of k-
  means clustering (k = 2)?

July 7, 2009              Data Mining: R. Akerkar                  31
Association Rule Discovery




July 7, 2009      Data Mining: R. Akerkar   32
Associations discovery

• Associations discovery uncovers affinities
  amongst collection of items
• Affinities are represented by association rules
• Associations discovery is an unsupervised
  approach to data mining.




July 7, 2009         Data Mining: R. Akerkar        33
Association discovery is one of the most common
   forms of data mining that people closely
   associate with data mining, namely mining for
   gold through a vast database. The gold in this
      ld th     h       td t b     Th     ld i thi
   case is a rule that tells you something about your
   database that you did not already know, and
                                       know
   were probably unable to explicitly articulate




July 7, 2009        Data Mining: R. Akerkar         34
Association discovery is done using rule induction which
                      y             g
    basically tells a user how strong a pattern is and how
    likely it is to happen again. For instance a database of
    items scanned in a consumer market basket helps finding
    interesting patterns such as: If bagels are purchased then
    cream cheese is purchased 90% of the time and this
    pattern occurs i 3% of all shopping baskets
                     in     f ll h i b k

You go tell the data base to go find the rules, the rules that are
                                         rules
    pulled from the database are extracted and ordered to be
    presented to the user to according to the percentage of
    times they are correct and how often they apply. Often
    gets lot of rules and the user almost needs a second pass
    to find his/her gold nugget.
                    g       gg

July 7, 2009             Data Mining: R. Akerkar                35
Associations
• The problem of deriving associations from
  data
      – market-basket analysis
      – The popular algorithms are thus concerned with
        determining the set of frequent itemsets in a
        given set of operation databases.
      – The problem is to compute the frequency of
        occurrences of each itemset in the database.

July 7, 2009          Data Mining: R. Akerkar        36
Definition




July 7, 2009    Data Mining: R. Akerkar   37
Association Rules
• Algorithms that obtain association rules
  from data usually divide the task into two
                  y
  parts:
      – find the frequent itemsets and
      – form the rules from them.




July 7, 2009           Data Mining: R. Akerkar   38
Association Rules
• The problem of mining association rules can be
  divided into two sub-problems:




July 7, 2009       Data Mining: R. Akerkar         39
a priori Algorithm




July 7, 2009        Data Mining: R. Akerkar   40
Exercise 3
Suppose that L3 is the list
      {{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w},
        {b,c,x}, {p,q,r}, {p,q,s}, {p,q,t}, {p,r,s},
        {q,r,s}}
Which itemsets are placed in C4 by the join
 step of the Apriori algorithm? Which are
    p         p        g
 then removed by the prune step?

July 7, 2009            Data Mining: R. Akerkar        41
Exercise 4
• Given a dataset with four attributes w, x, y
  and z, each with three values, how many
        ,                       ,         y
  rules can be generated with one term on the
  right-hand side?
    g




July 7, 2009     Data Mining: R. Akerkar     42
References
•   R. Akerkar and P. Lingras. Building an Intelligent Web: Theory &
    Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House,
    2009)
•   U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
    Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,
                                             Mining           Press
    1996
•   U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in
    Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
•   J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
    Kaufmann, 2001
    K f
•   D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT
    Press, 2001
         ,

      July 7, 2009                   Data Mining: R. Akerkar               43

Mais conteúdo relacionado

Mais procurados

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Problem solving agents
Problem solving agentsProblem solving agents
Problem solving agentsMegha Sharma
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithmsankit panigrahy
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)Shweta Ghate
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representationSravanthi Emani
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesDr. C.V. Suresh Babu
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAbhishek Singh
 

Mais procurados (20)

Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Problem solving agents
Problem solving agentsProblem solving agents
Problem solving agents
 
Activation function
Activation functionActivation function
Activation function
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Data mining technique (decision tree)
Data mining technique (decision tree)Data mining technique (decision tree)
Data mining technique (decision tree)
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representation
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching Techniques
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 

Semelhante a Data mining

decisiontree-110906040745-phpapp01.pptx
decisiontree-110906040745-phpapp01.pptxdecisiontree-110906040745-phpapp01.pptx
decisiontree-110906040745-phpapp01.pptxABINASHPADHY6
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3Laila Fatehy
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
decision tree.pdf
decision tree.pdfdecision tree.pdf
decision tree.pdfDivitGoyal2
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdfHODIT12
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1MostafaHazemMostafaa
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdfAiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdfCHIRAGGOWDA41
 
Data mining 5 klasifikasi decision tree dan random forest
Data mining 5   klasifikasi decision tree dan random forestData mining 5   klasifikasi decision tree dan random forest
Data mining 5 klasifikasi decision tree dan random forestIrwansyahSaputra1
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2Nandhini S
 
Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision TreeSuman Debnath
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with RMaarten Smeets
 

Semelhante a Data mining (20)

Decision tree
Decision treeDecision tree
Decision tree
 
decisiontree-110906040745-phpapp01.pptx
decisiontree-110906040745-phpapp01.pptxdecisiontree-110906040745-phpapp01.pptx
decisiontree-110906040745-phpapp01.pptx
 
Data Mining
Data MiningData Mining
Data Mining
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
decision tree.pdf
decision tree.pdfdecision tree.pdf
decision tree.pdf
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
Lecture2-DT.pptx
Lecture2-DT.pptxLecture2-DT.pptx
Lecture2-DT.pptx
 
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdfAiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
 
Data mining 5 klasifikasi decision tree dan random forest
Data mining 5   klasifikasi decision tree dan random forestData mining 5   klasifikasi decision tree dan random forest
Data mining 5 klasifikasi decision tree dan random forest
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision Tree
 
CS632_Lecture_15_updated.pptx
CS632_Lecture_15_updated.pptxCS632_Lecture_15_updated.pptx
CS632_Lecture_15_updated.pptx
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
 
Lec 18-19.pptx
Lec 18-19.pptxLec 18-19.pptx
Lec 18-19.pptx
 

Mais de R A Akerkar

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoprojectR A Akerkar
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaR A Akerkar
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?R A Akerkar
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation R A Akerkar
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?R A Akerkar
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big DataR A Akerkar
 
Linked open data
Linked open dataLinked open data
Linked open dataR A Akerkar
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extractionR A Akerkar
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data setsR A Akerkar
 
Description logics
Description logicsDescription logics
Description logicsR A Akerkar
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligenceR A Akerkar
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based ReasoningR A Akerkar
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup R A Akerkar
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language systemR A Akerkar
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization SystemsR A Akerkar
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignR A Akerkar
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling LanguageR A Akerkar
 
Statistical Preliminaries
Statistical PreliminariesStatistical Preliminaries
Statistical PreliminariesR A Akerkar
 
Software project management
Software project managementSoftware project management
Software project managementR A Akerkar
 

Mais de R A Akerkar (20)

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoproject
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big Data
 
Linked open data
Linked open dataLinked open data
Linked open data
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data sets
 
Description logics
Description logicsDescription logics
Description logics
 
Link analysis
Link analysisLink analysis
Link analysis
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligence
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based Reasoning
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language system
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization Systems
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface Design
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling Language
 
Statistical Preliminaries
Statistical PreliminariesStatistical Preliminaries
Statistical Preliminaries
 
Software project management
Software project managementSoftware project management
Software project management
 

Último

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 

Último (20)

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 

Data mining

  • 1. Data Mining Rajendra Akerkar July 7, 2009 Data Mining: R. Akerkar 1
  • 2. What Is Data Mining? • Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Is everything “data mining”? – (Deductive) query processing. processing – Expert systems or small ML/statistical programs July 7, 2009 Data Mining: R. Akerkar 2
  • 3. Definition • Several Definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi automatic semi-automatic means, of large quantities of data in order to discover meaningful patterns July 7, 2009 Data Mining: R. Akerkar 3
  • 4. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 [ yy , ] g y g, July 7, 2009 Data Mining: R. Akerkar 4
  • 5. Classification July 7, 2009 Data Mining: R. Akerkar 5
  • 6. Classification: D fi iti Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: pre io sl unseen records should be assigned previously nseen sho ld a class as accurately as possible. – A test set is used to determine the accuracy of the model. y Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. July 7, 2009 Data Mining: R. Akerkar 6
  • 7. Classification: Introduction • A classification scheme which generates a tree and a set of rules from given data set. • The attributes of the records are categorise into two types: – Attributes whose domain is numerical are called numerical attributes. – A ib Attributes whose domain is not numerical are h d i i i l called the categorical attributes. July 7, 2009 Data Mining: R. Akerkar 7
  • 8. Decision Tree • A decision tree is a tree with the following properties: – An inner node represents an attribute. – A edge represents a t t on the attribute of the father An d t test th tt ib t f th f th node. – A leaf represents one of the classes. • Construction of a decision tree – Based on the training data – Top-Down strategy July 7, 2009 Data Mining: R. Akerkar 8
  • 9. July 7, 2009 Data Mining: R. Akerkar 9
  • 10. Decision Tree Example • The data set has five attributes. • There is a special attribute: the attribute class is the class label. label • The attributes, temp (temperature) and humidity are numerical attributes • Other attributes are categorical, that is, they cannot be categorical is ordered. • Based on the training data set, we want to find a set of set rules to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf. July 7, 2009 Data Mining: R. Akerkar 10
  • 11. Decision Tree Example • We have five leaf nodes. • In a decision tree, each leaf node represents a rule. , p • We have the following rules corresponding to the tree given in Figure. • RULE 1 If it is sunny and the humidity is not above 75%, then play. • RULE 2 If it is sunny and the humidity is above 75%, then do not play. f y y , p y • RULE 3 If it is overcast, then play. • RULE 4 If it is rainy and not windy, then play. • RULE 5 If it is rainy and windy, then don't play. July 7, 2009 Data Mining: R. Akerkar 11
  • 12. July 7, 2009 Data Mining: R. Akerkar 12
  • 13. Iterative Dichotomizer (ID3) • Quinlan (1986) • Each d E h node corresponds to a splitting attribute d t litti tt ib t • Each arc is a possible value of that attribute. • At each node the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root. • Entropy is used to measure how informative is a node. • The algorithm uses the criterion of information gain to determine the goodness of a split. – The attribute with the greatest information gain is taken as g g the splitting attribute, and the data set is split for all distinct values of the attribute. July 7, 2009 Data Mining: R. Akerkar 13
  • 14. Training Dataset This follows an example from Quinlan’s ID3 age income student credit_rating buys_computer <=30 high no fair no <=30 g high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 g high y yes fair y yes >40 medium no excellent no July 7, 2009 Data Mining: R. Akerkar 14
  • 15. Extracting Classification Rules from Trees • Represent the k R h knowledge in the l d i h form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules R l are easier for humans to i f h understand What are the rules? July 7, 2009 Data Mining: R. Akerkar 15
  • 16. Attribute Selection Measure: Information Gain (ID3/C4.5) (ID3/C4 5)  Select the attribute with the highest information gain  S contains si tuples of class Ci for i = {1, …, m}  information measures info required to classify any q y y arbitrary tuple m si si I( s1,s 2,...,s m )    log 2  i 1 s s ….information is encoded in bits.  entropy of attribute A with values {a1,a2,…,av} f b h l { v s1 j  ... smj E(A)  I( s1 j ,...,smj ) j 1 s  information gained by branching on attribute A Gain(A)  I(s 1, s 2 ,..., sm)  E(A) July 7, 2009 Data Mining: R. Akerkar 16
  • 17. age pi ni I(pi, ni)  Class P: buys_computer = “yes”  Cl Class N: buys_computer = “no” N b t “ ” <=30 2 3 0.971  I(p, n) = I(9, 5) =0.940 30…40 4 0 0  Compute the entropy for age: >40 3 2 0 971 0.971 age income student credit_rating buys_computer <=30 high g no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low ow yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no < 30 <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 31 40 high yes fair yes >40 medium no excellent no July 7, 2009 Data Mining: R. Akerkar 17
  • 18. Attribute Selection by Information Gain Computation 5 4 E ( age )  g I ( 2 ,3)  I ( 4,0 ) 14 14 5  I (3, 2 )  0 .694 14 5 I ( 2 ,3 ) means “age <=30” has 5 out of 14 samples, with 2 yes's 14 and 3 no’s. Hence Gain ( age )  I ( p , n )  E ( age )  0.246 Similarly, Gain(income)  0.029 Gain( student )  0.151 Gain(credit _ rating )  0.048 July 7, 2009 Data Mining: R. Akerkar 18
  • 19. Exercise 1 • The following table consists of training data from an employee database. database • Let status be the class attribute. Use the ID3 algorithm to construct a decision tree from the given data. July 7, 2009 Data Mining: R. Akerkar 19
  • 20. Clustering July 7, 2009 Data Mining: R. Akerkar 20
  • 21. Clustering: Definition • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another. h • Similarity Measures: – E lid Euclidean Distance if attributes are continuous. Di t tt ib t ti – Other Problem-specific Measures. July 7, 2009 Data Mining: R. Akerkar 21
  • 22. The K-Means Clustering Method K Means • Given k, the k-means algorithm is implemented in k means four steps: – Partition objects into k nonempty subsets – Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean center i e point, of the cluster) – Assign each object to the cluster with the nearest seed point – Go back to Step 2, stop when no more new assignment July 7, 2009 Data Mining: R. Akerkar 22
  • 23. Visualization of k-means k means algorithm July 7, 2009 Data Mining: R. Akerkar 23
  • 24. Exercise 2 • Apply the K-means algorithm for the following 1-dimensional points (for k=2): 1; g p ( ) ; 2; 3; 4; 6; 7; 8; 9. • Use 1 and 2 as the starting centroids. centroids July 7, 2009 Data Mining: R. Akerkar 24
  • 25. K – Mean for 2-dimensional 2 dimensional database • Let us consider {x1, x2, x3, x4, x5} with following coordinates as two-dimensional sample for clustering: • x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2) • Suppose that required number of clusters is 2. • Initially, clusters are formed from random distribution of samples: • C1 = {x1, x2, x4} and C2 = {x3, x5}. July 7, 2009 Data Mining: R. Akerkar 25
  • 26. Centroid Calculation • Suppose that the given set of N samples in an n-dimensional space has somehow be partitioned into K clusters {C1, C2, …, Ck} • Each Ck has nk samples and each sample is exactly in one cluster. • Therefore,  nk = N, where k = 1, …, K. • The mean vector Mk of cluster Ck is defined as centroid of the cluster, nk Mk = (1/ nk)  i = 1 xik Where xik is the ith sample belonging to cluster Ck. • In our example The centroids for these two clusters are example, • M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66} • M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00} ) ( ) } { } July 7, 2009 Data Mining: R. Akerkar 26
  • 27. The S Th Square-error of th cluster f the l t • The square-error for cluster Ck is the sum of squared Euclidean distances between each sample in Ck and its centroid. • Thi error is called the within-cluster variation. This i ll d th ithi l t i ti ek2 = n k i=1 (xik – Mk)2 • Within cluster variations, after initial random distribution of samples, are • e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2] + [(5 – 1.66)2 + (0 – 0.66)2] = 19.36 • e22 = [(1.5 – 3.25)2 + (0 – 1)2] + [( – 3.25)2 + (2 – 1)2] = 8.12 [( ) ( ) [(5 ) ( ) July 7, 2009 Data Mining: R. Akerkar 27
  • 28. Total Square-error • The square error for the entire clustering space containing K clusters is the sum of the within-cluster variations. K i i Ek2 =  k = 1 ek 2 • The total square error is E2 = e12 + e22 = 19.36 + 8.12 = 27.48 July 7, 2009 Data Mining: R. Akerkar 28
  • 29. • When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of samples inside clusters will be, be • d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40  x1  C1 • d(M1, x2) = 1.79 and d(M2, x2) = 3.40  x2  C1 d(M1, x3) = 0.83 and d(M2, x3) = 2.01  x3  C1 d(M1, x4) = 3.41 and d(M2, x4) = 2.01  x4  C2 d(M1, x5) = 3.60 and d(M2, x5) = 2.01  x5  C2 Above calculation is based on Euclidean distance formula, d(xi, xj) = k = 1 (xik – xjk)1/2 m July 7, 2009 Data Mining: R. Akerkar 29
  • 30. • New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids • M1 = {0.5, 0.67} • M2 = {5.0, 1.0} • The corresponding within-cluster variations and the total square error are, • e12 = 4.17 • e22 = 2.00 • E2 = 6.17 July 7, 2009 Data Mining: R. Akerkar 30
  • 31. Exercise 3 Let the set X consist of the following sample points in 2 dimensional space: X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)} Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of centroids for X. What are the revised values of c1 and c2 after 1 iteration of k- means clustering (k = 2)? July 7, 2009 Data Mining: R. Akerkar 31
  • 32. Association Rule Discovery July 7, 2009 Data Mining: R. Akerkar 32
  • 33. Associations discovery • Associations discovery uncovers affinities amongst collection of items • Affinities are represented by association rules • Associations discovery is an unsupervised approach to data mining. July 7, 2009 Data Mining: R. Akerkar 33
  • 34. Association discovery is one of the most common forms of data mining that people closely associate with data mining, namely mining for gold through a vast database. The gold in this ld th h td t b Th ld i thi case is a rule that tells you something about your database that you did not already know, and know were probably unable to explicitly articulate July 7, 2009 Data Mining: R. Akerkar 34
  • 35. Association discovery is done using rule induction which y g basically tells a user how strong a pattern is and how likely it is to happen again. For instance a database of items scanned in a consumer market basket helps finding interesting patterns such as: If bagels are purchased then cream cheese is purchased 90% of the time and this pattern occurs i 3% of all shopping baskets in f ll h i b k You go tell the data base to go find the rules, the rules that are rules pulled from the database are extracted and ordered to be presented to the user to according to the percentage of times they are correct and how often they apply. Often gets lot of rules and the user almost needs a second pass to find his/her gold nugget. g gg July 7, 2009 Data Mining: R. Akerkar 35
  • 36. Associations • The problem of deriving associations from data – market-basket analysis – The popular algorithms are thus concerned with determining the set of frequent itemsets in a given set of operation databases. – The problem is to compute the frequency of occurrences of each itemset in the database. July 7, 2009 Data Mining: R. Akerkar 36
  • 37. Definition July 7, 2009 Data Mining: R. Akerkar 37
  • 38. Association Rules • Algorithms that obtain association rules from data usually divide the task into two y parts: – find the frequent itemsets and – form the rules from them. July 7, 2009 Data Mining: R. Akerkar 38
  • 39. Association Rules • The problem of mining association rules can be divided into two sub-problems: July 7, 2009 Data Mining: R. Akerkar 39
  • 40. a priori Algorithm July 7, 2009 Data Mining: R. Akerkar 40
  • 41. Exercise 3 Suppose that L3 is the list {{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w}, {b,c,x}, {p,q,r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s}} Which itemsets are placed in C4 by the join step of the Apriori algorithm? Which are p p g then removed by the prune step? July 7, 2009 Data Mining: R. Akerkar 41
  • 42. Exercise 4 • Given a dataset with four attributes w, x, y and z, each with three values, how many , , y rules can be generated with one term on the right-hand side? g July 7, 2009 Data Mining: R. Akerkar 42
  • 43. References • R. Akerkar and P. Lingras. Building an Intelligent Web: Theory & Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House, 2009) • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Mining Press 1996 • U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 • J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001 K f • D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 , July 7, 2009 Data Mining: R. Akerkar 43