SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Decision Tree
Dae-Ki Kang
Definition
• Definition #1
 ▫ A hierarchy of if-then’s
 ▫ Node – test
 ▫ Edge – direction of control
• Definition #2
 ▫ A tree that represents compression of data based
   on class
• Manually generated decision tree is not
  interesting at all!
Decision tree for mushroom data
Algorithms
• ID3
 ▫ Information gain
• C4.5 (=J48 in WEKA) (and See5/C5.0)
 ▫ Information gain ratio
• Classification and regression tree (CART)
 ▫ Gini gain
• Chi-squared automatic interaction detection
  (CHAID)
Example from Tom Mitchell’s book
Naï strategy of choosing attributes
       ve
   (i.e. choose the next available attribute)
                                                                       Play=Yes                     Play=No
                                                   Outlook
                                                                       3,4,5,7,9,10,11,12,13 (9)    1,2,6,8,14 (5)


                           Sunny
                                                                       Rain
                                        Overcast
Play=Yes            Play=No                                                    Play=Yes                     Play=No

9,11 (2)            1,2,8 (3)                                                  4,5,10 (3)                   6,14 (2)



             Temp                  Play=Yes                  Play=No
                                                                                                   Temp
                                   3,7,12,13 (4)             (0)

     Hot             Cool                                                               Hot                          Cool
                                                    Play=Y
           Mild                                                                                Mild
How to generate decision trees?
• Optimal one
 ▫ Equal to (or harder than) NP-Hard
• Greedy one
    Greedy means big questions first
    Strategy – divide and conquer
 ▫ Choose an easy-to-understand test such that divided
   sub-data sets by the chosen test are the easiest to deal
   with
    Usually choose an attribute as a test
    Usually adopt impurity measure to see how easy to deal
     with the sub-data sets
• Are there any other approaches? – there are many
  and open
Impurity criteria
• Entropy  Information Gain, Information Gain Ratio
  ▫   Most popular
  ▫   Entropy – Sum of -plogp
  ▫   IG – Entropy(S) - Sum of Entropy(sub-data t) * |t|/|S|
  ▫   IG favors Social Security Number or ID
  ▫   Information Gain Ratio
• Gini index  Gini Gain (used in CART)
  ▫ Related with Area Under the Curve
  ▫ GG – 1 - Sum of fractions^2
• Misclassification rate
  ▫ (misclassified instances)/(all instances)
  ▫ Problematic – lead to many indistinguishable splits (where
    other splits are more desirable)
Using IG for choosing attributes
                                                                       Play=Yes                          Play=No
                                                 Outlook
                         Sunny                                         3,4,5,7,9,10,11,12,13 (9)         1,2,6,8,14 (5)


Play=Yes     Play=No                  Overcast                  Rain
9,11 (2)     1,2,8 (3)
                                                                             Play=Yes              Play=No
                                 Play=Yes             Play=No
                                                                             4,5,10 (3)            6,14 (2)
                                 3,7,12,13 (4)        (0)




IG(S) = Entropy(S) – Sum(|S_i|/|S|*Entropy(S_i))

IG(Outlook)= Entropy(Outlook)
        -|Sunny|/|Outlook|*Entropy(Sunny)
        -|Overcast|/|Outlook|*Entropy(Overcast)
        -|Rain|/|Outlook|*Entropy(Rain)

Entropy(Outlook) =-(9/14)*log(9/14)-(5/14)*log(5/14)
|Sunny|/|Outlook|*Entropy(Sunny) = 5/14*(-(2/5)*log(2/5)-(3/5)*log(3/5))
|Overcast|/|Outlook|*Entropy(Overcast) = 4/14*(-(4/4)*log(4/4)-(0/4)*log(0/4))
|Rain|/|Outlook|*Entropy(Rain) = 5/14*(-(3/5)*log(3/5)-(2/5)*log(2/5))
Zero Occurrence
• When a feature is never occurred in the training
  set  zero frequency  PANIC: makes all terms
  zero
• Smoothing the distribution
 ▫ Laplacian Smoothing
 ▫ Dirichlet Priors Smoothing
 ▫ and many more (Absolute Discouting, Jelinek-
   Mercer smoothing, Katz smoothing, Good-Turing
   smoothing, etc.)
Calculating IG with Laplacian smoothing
                                                                       Play=Yes                          Play=No
                                                 Outlook
                         Sunny                                         3,4,5,7,9,10,11,12,13 (9)         1,2,6,8,14 (5)


Play=Yes     Play=No                  Overcast                  Rain
9,11 (2)     1,2,8 (3)
                                                                             Play=Yes              Play=No
                                 Play=Yes             Play=No
                                                                             4,5,10 (3)            6,14 (2)
                                 3,7,12,13 (4)        (0)




IG(S) = Entropy(S) – Sum(|S_i|/|S|*Entropy(S_i))

IG(Outlook)= Entropy(Outlook)
        -|Sunny|/|Outlook|*Entropy(Sunny)
        -|Overcast|/|Outlook|*Entropy(Overcast)
        -|Rain|/|Outlook|*Entropy(Rain)

Entropy(Outlook) =-(10/16)*log(10/16)-(6/16)*log(6/16)
|Sunny|/|Outlook|*Entropy(Sunny) = 6/17*(-(3/7)*log(3/7)-(4/7)*log(4/7))
|Overcast|/|Outlook|*Entropy(Overcast) = 5/17*(-(5/6)*log(5/6)-(1/6)*log(1/6))
|Rain|/|Outlook|*Entropy(Rain) = 6/17*(-(4/7)*log(4/7)-(3/7)*log(3/7))
Overfitting
• Training set error
  ▫ Error of the classifier on the training data
  ▫ It is a bad idea to use up all data for training.  You will be out of data to
    evaluate the learning algorithm.
• Test set error
  ▫ Error of the classifier on the test data
  ▫ Jackknife – Use n-1 examples to learn and 1 to test. Repeat n times.
  ▫ x-folds stratified cross-validation – Divide data into x-folds with the
    same proportion of class. x-1 folds to train and 1 fold to test. Repeat x
    times.
• Overfitting
  ▫   The input data is incomplete (Quine)
  ▫   The input data do not reflect all possible cases.
  ▫   The input data can include noise.
  ▫   I.e. fit the classifier tightly to the input data is a bad idea.
• Occam’s razor
  ▫ Old axiom used to prove the existence of God.
  ▫ “plurality should not be posited without necessity”
Razors and Canon
• Occam's razor (Ockham's razor)
  ▫ "Plurality is not to be posited without necessity"
  ▫ Similar to a principle of parsimony
  ▫ If two hypothesis have almost equal prediction power, we
    prefer more concise one.
• Hanlon's razor
  ▫ Never attribute to malice that which is adequately
    explained by stupidity.
• Morgan's Canon
  ▫ In no case is an animal activity to be interpreted in terms of
    higher psychological processes if it can be fairly interpreted
    in terms of processes which stand lower in the scale of
    psychological evolution and development.
Example: Playing Tennis
       (taken from Andrew Moore’s)
                         Humidity             (9+, 5-)                                Wind             (9+, 5-)

                    High          Norm                                       Weak           Strong


           (3+, 4-)                        (6+, 1-)                                                 (3+, 3-)
                                                                       (6+, 2-)

                  P(h, p)                      P(n, p)                              P( w, p)                   P ( s, p )
I h  P(h, p) log                P(n, p) log                   I w  P( w, p) log             P( s, p) log             
                 P ( h) P ( p )               P ( n) P ( p )                       P( w) P( p)                P( s) P( p)
                  P(h, p)                          P(n, p)                        P( w, p)                      P( s, p)
    P(h, p) log                  P(n, p) log                      P( w, p) log               P( s, p) log
                 P(h) P(p)                       P(n) P(p)                       P( w) P(p)                   P( s ) P(p)
    0.151                                                           0.048
Predication for Nodes




          What is the predication for each node?




              From Andrew Moore’s slides
Predication for Nodes
Recursively Growing Trees
                                              cylinders = 4



                                                cylinders = 5


                                                   cylinders = 6

Original     Partition it
Dataset      according                              cylinders = 8
             to the value of
             the attribute
             we split on


                               From Andrew Moore slides
Recursively Growing Trees




       Build tree from   Build tree from   Build tree from       Build tree from
       These records..   These records..   These records..       These records..




   cylinders = 4         cylinders = 5     cylinders = 6               cylinders = 8




                                                           From Andrew Moore slides
A Two Level Tree




        Recursively
       growing trees
When should We Stop Growing Trees?


             Should we split
               this node ?
Base Cases
• Base Case One: If all records in current data subset have
  the same output then don’t recurse
• Base Case Two: If all records have exactly the same set of
  input attributes then don’t recurse
Base Cases: An idea
• Base Case One: If all records in current data subset have
  the same output then don’t recurse
• Base Case Two: If all records have exactly the same set of
  input attributes then don’t recurse

                   Proposed Base Case 3:

            If all attributes have zero information
                     gain then don’t recurse



                                       Is this a good idea?
Old Topic: Overfitting
Pruning
• Prepruning (=forward pruning)

• Postpruning (=backward pruning)
 ▫ Reduced error pruning
 ▫ Rule-post pruning
Pruning Decision Tree
• Stop growing trees in time
• Build the full decision tree as before.
• But when you can grow it no more, start to
  prune:
 ▫ Reduced error pruning
 ▫ Rule post-pruning
Reduced Error Pruning
• Split data into training and validation set
• Build a full decision tree over the training set
• Keep removing node that maximally increases
  validation set accuracy
Original Decision Tree
Pruned Decision Tree
Reduced Error Pruning
Rule Post-Pruning
• Convert tree into rules
• Prune rules by removing the preconditions
• Sort final rules by their estimated accuracy

Most widely used method (e.g., C4.5)
Other methods: statistical significance test (chi-
 square)
Real Value Inputs
• What should we do to deal with real value inputs?
     mpg    cylinders displacementhorsepower weight acceleration modelyear maker

     good           4         97          75       2265       18.2       77   asia
     bad            6        199          90       2648         15       70   america
     bad            4        121         110       2600       12.8       77   europe
     bad            8        350         175       4100         13       73   america
     bad            6        198          95       3102       16.5       74   america
     bad            4        108          94       2379       16.5       73   asia
     bad            4        113          95       2228         14       71   asia
     bad            8        302         139       3570       12.8       78   america
     :      :           :          :           :          :          :        :
     :      :           :          :           :          :          :        :
     :      :           :          :           :          :          :        :
     good           4        120          79       2625       18.6       82   america
     bad            8        455         225       4425         10       70   america
     good           4        107          86       2464       15.5       76   europe
     bad            5        131         103       2830       15.9       78   europe
Information Gain
• x: a real value input
• t: split value
• Find the split value t such that the mutual
  information I(x, y: t) between x and the class
  label y is maximized.
Pros and Cons
• Pros
 ▫   Easy to understand
 ▫   Fast learning algorithms (because they are greedy)
 ▫   Robust to noise
 ▫   Good accuracy
• Cons
 ▫   Unstable
 ▫   Hard to represent some functions (Parity, XOR, etc.)
 ▫   Duplication in subtrees
 ▫   Cannot be used to express all first order logic because
     the test cannot refer to two or more different objects
Generation of data from a decision
tree (based on the definition #2)
• Decision tree with support for each node 
  Rule set
 ▫ support = # of training instances assigned for a
   node
• Rule set  Instances
• In this way, one can combine multiple decision
  trees by combining rule sets

• cf. Bayesian classifiers  Fractional instances
Extensions and further considerations
• Extensions
 ▫   Alternating decision tree
 ▫   Naï Bayes Tree
         ve
 ▫   Attribute Value Taxonomy guided Decision Tree
 ▫   Recursive Naï Bayes
                   ve
 ▫   and many more
• Further Researches
 ▫   Decision graph
 ▫   Bottom up generation of decision tree
 ▫   Evolutionary construction of decision tree
 ▫   Integrating two decision trees
 ▫   and many more

Mais conteúdo relacionado

Destaque

Ostasz Studio Fine Jewelry
Ostasz Studio Fine JewelryOstasz Studio Fine Jewelry
Ostasz Studio Fine JewelryMarekOstasz
 
Secondary academic information session
Secondary academic information sessionSecondary academic information session
Secondary academic information sessionEd
 
2010 sns小结
2010 sns小结2010 sns小结
2010 sns小结worldhema
 
懷念臺灣 光陰隧道L呷飯
懷念臺灣 光陰隧道L呷飯懷念臺灣 光陰隧道L呷飯
懷念臺灣 光陰隧道L呷飯psjlew
 
Fun joke 解鬱集_一年份的笑話_52
Fun joke 解鬱集_一年份的笑話_52Fun joke 解鬱集_一年份的笑話_52
Fun joke 解鬱集_一年份的笑話_52psjlew
 
Values Game: Establishing Relational Capacity
Values Game: Establishing Relational Capacity Values Game: Establishing Relational Capacity
Values Game: Establishing Relational Capacity omarraarmstrong
 
How to make money on the fly
How to make money on the flyHow to make money on the fly
How to make money on the flyKolawole Bisiriyu
 
Jessica cox 雙語
Jessica cox 雙語Jessica cox 雙語
Jessica cox 雙語psjlew
 
中國兩個第一家庭
中國兩個第一家庭中國兩個第一家庭
中國兩個第一家庭psjlew
 
ザックリとDocker
ザックリとDockerザックリとDocker
ザックリとDockerionis111
 
台灣綠島
台灣綠島台灣綠島
台灣綠島psjlew
 
Cyclus Full Product Line
Cyclus Full Product LineCyclus Full Product Line
Cyclus Full Product Linemiguelmiami
 
Car art
Car artCar art
Car artpsjlew
 
日本何以處“震”不驚
日本何以處“震”不驚日本何以處“震”不驚
日本何以處“震”不驚psjlew
 
2014-1 Computer Algorithm W12 Notes
2014-1 Computer Algorithm W12 Notes2014-1 Computer Algorithm W12 Notes
2014-1 Computer Algorithm W12 NotesDongseo University
 
北京到拉薩
北京到拉薩北京到拉薩
北京到拉薩psjlew
 
不宜常吃的八道家常菜
不宜常吃的八道家常菜不宜常吃的八道家常菜
不宜常吃的八道家常菜psjlew
 
A Portfolio 101007
A Portfolio 101007A Portfolio 101007
A Portfolio 101007TRKnappArch
 

Destaque (20)

Ostasz Studio Fine Jewelry
Ostasz Studio Fine JewelryOstasz Studio Fine Jewelry
Ostasz Studio Fine Jewelry
 
Secondary academic information session
Secondary academic information sessionSecondary academic information session
Secondary academic information session
 
2010 sns小结
2010 sns小结2010 sns小结
2010 sns小结
 
懷念臺灣 光陰隧道L呷飯
懷念臺灣 光陰隧道L呷飯懷念臺灣 光陰隧道L呷飯
懷念臺灣 光陰隧道L呷飯
 
Fun joke 解鬱集_一年份的笑話_52
Fun joke 解鬱集_一年份的笑話_52Fun joke 解鬱集_一年份的笑話_52
Fun joke 解鬱集_一年份的笑話_52
 
Values Game: Establishing Relational Capacity
Values Game: Establishing Relational Capacity Values Game: Establishing Relational Capacity
Values Game: Establishing Relational Capacity
 
How to make money on the fly
How to make money on the flyHow to make money on the fly
How to make money on the fly
 
Jessica cox 雙語
Jessica cox 雙語Jessica cox 雙語
Jessica cox 雙語
 
中國兩個第一家庭
中國兩個第一家庭中國兩個第一家庭
中國兩個第一家庭
 
ザックリとDocker
ザックリとDockerザックリとDocker
ザックリとDocker
 
台灣綠島
台灣綠島台灣綠島
台灣綠島
 
Cyclus Full Product Line
Cyclus Full Product LineCyclus Full Product Line
Cyclus Full Product Line
 
Car art
Car artCar art
Car art
 
日本何以處“震”不驚
日本何以處“震”不驚日本何以處“震”不驚
日本何以處“震”不驚
 
2014-1 Computer Algorithm W12 Notes
2014-1 Computer Algorithm W12 Notes2014-1 Computer Algorithm W12 Notes
2014-1 Computer Algorithm W12 Notes
 
北京到拉薩
北京到拉薩北京到拉薩
北京到拉薩
 
Marutisuzuki Ltd
Marutisuzuki LtdMarutisuzuki Ltd
Marutisuzuki Ltd
 
不宜常吃的八道家常菜
不宜常吃的八道家常菜不宜常吃的八道家常菜
不宜常吃的八道家常菜
 
Mikaels presentation
Mikaels presentationMikaels presentation
Mikaels presentation
 
A Portfolio 101007
A Portfolio 101007A Portfolio 101007
A Portfolio 101007
 

Semelhante a 2013-1 Machine Learning Lecture 02 - Decision Trees

Machine Learning 2D5362
Machine Learning 2D5362Machine Learning 2D5362
Machine Learning 2D5362butest
 
Datamining R 2nd
Datamining R 2ndDatamining R 2nd
Datamining R 2ndsesejun
 
Naive Bayes.pptx
Naive Bayes.pptxNaive Bayes.pptx
Naive Bayes.pptxSobanSquad1
 
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative ModelDeep Learning JP
 
2011 scinto standings
2011 scinto standings2011 scinto standings
2011 scinto standingskbsmoka
 
Datamining r 2nd
Datamining r 2ndDatamining r 2nd
Datamining r 2ndsesejun
 
ID3_Explanation.pptx
ID3_Explanation.pptxID3_Explanation.pptx
ID3_Explanation.pptxSanketMani1
 
PyMCがあれば,ベイズ推定でもう泣いたりなんかしない
PyMCがあれば,ベイズ推定でもう泣いたりなんかしないPyMCがあれば,ベイズ推定でもう泣いたりなんかしない
PyMCがあれば,ベイズ推定でもう泣いたりなんかしないToshihiro Kamishima
 

Semelhante a 2013-1 Machine Learning Lecture 02 - Decision Trees (9)

Machine Learning 2D5362
Machine Learning 2D5362Machine Learning 2D5362
Machine Learning 2D5362
 
Naive bayes classifier
Naive bayes classifierNaive bayes classifier
Naive bayes classifier
 
Datamining R 2nd
Datamining R 2ndDatamining R 2nd
Datamining R 2nd
 
Naive Bayes.pptx
Naive Bayes.pptxNaive Bayes.pptx
Naive Bayes.pptx
 
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model
【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model
 
2011 scinto standings
2011 scinto standings2011 scinto standings
2011 scinto standings
 
Datamining r 2nd
Datamining r 2ndDatamining r 2nd
Datamining r 2nd
 
ID3_Explanation.pptx
ID3_Explanation.pptxID3_Explanation.pptx
ID3_Explanation.pptx
 
PyMCがあれば,ベイズ推定でもう泣いたりなんかしない
PyMCがあれば,ベイズ推定でもう泣いたりなんかしないPyMCがあれば,ベイズ推定でもう泣いたりなんかしない
PyMCがあれば,ベイズ推定でもう泣いたりなんかしない
 

Mais de Dongseo University

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfDongseo University
 
Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Dongseo University
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Dongseo University
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Dongseo University
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection AlgorithmDongseo University
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison SortDongseo University
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayDongseo University
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction ExampleDongseo University
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributionsDongseo University
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)Dongseo University
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)Dongseo University
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5Dongseo University
 

Mais de Dongseo University (20)

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
 
Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01
 
Markov Chain Monte Carlo
Markov Chain Monte CarloMarkov Chain Monte Carlo
Markov Chain Monte Carlo
 
Simplex Lecture Notes
Simplex Lecture NotesSimplex Lecture Notes
Simplex Lecture Notes
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Median of Medians
Median of MediansMedian of Medians
Median of Medians
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection Algorithm
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison Sort
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using Array
 
Running Time of MergeSort
Running Time of MergeSortRunning Time of MergeSort
Running Time of MergeSort
 
Binary Trees
Binary TreesBinary Trees
Binary Trees
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction Example
 
TRPO and PPO notes
TRPO and PPO notesTRPO and PPO notes
TRPO and PPO notes
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributions
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)
 
2017-2 ML W11 GAN #1
2017-2 ML W11 GAN #12017-2 ML W11 GAN #1
2017-2 ML W11 GAN #1
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5
 

2013-1 Machine Learning Lecture 02 - Decision Trees

  • 2. Definition • Definition #1 ▫ A hierarchy of if-then’s ▫ Node – test ▫ Edge – direction of control • Definition #2 ▫ A tree that represents compression of data based on class • Manually generated decision tree is not interesting at all!
  • 3. Decision tree for mushroom data
  • 4. Algorithms • ID3 ▫ Information gain • C4.5 (=J48 in WEKA) (and See5/C5.0) ▫ Information gain ratio • Classification and regression tree (CART) ▫ Gini gain • Chi-squared automatic interaction detection (CHAID)
  • 5. Example from Tom Mitchell’s book
  • 6. Naï strategy of choosing attributes ve (i.e. choose the next available attribute) Play=Yes Play=No Outlook 3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5) Sunny Rain Overcast Play=Yes Play=No Play=Yes Play=No 9,11 (2) 1,2,8 (3) 4,5,10 (3) 6,14 (2) Temp Play=Yes Play=No Temp 3,7,12,13 (4) (0) Hot Cool Hot Cool Play=Y Mild Mild
  • 7. How to generate decision trees? • Optimal one ▫ Equal to (or harder than) NP-Hard • Greedy one  Greedy means big questions first  Strategy – divide and conquer ▫ Choose an easy-to-understand test such that divided sub-data sets by the chosen test are the easiest to deal with  Usually choose an attribute as a test  Usually adopt impurity measure to see how easy to deal with the sub-data sets • Are there any other approaches? – there are many and open
  • 8. Impurity criteria • Entropy  Information Gain, Information Gain Ratio ▫ Most popular ▫ Entropy – Sum of -plogp ▫ IG – Entropy(S) - Sum of Entropy(sub-data t) * |t|/|S| ▫ IG favors Social Security Number or ID ▫ Information Gain Ratio • Gini index  Gini Gain (used in CART) ▫ Related with Area Under the Curve ▫ GG – 1 - Sum of fractions^2 • Misclassification rate ▫ (misclassified instances)/(all instances) ▫ Problematic – lead to many indistinguishable splits (where other splits are more desirable)
  • 9. Using IG for choosing attributes Play=Yes Play=No Outlook Sunny 3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5) Play=Yes Play=No Overcast Rain 9,11 (2) 1,2,8 (3) Play=Yes Play=No Play=Yes Play=No 4,5,10 (3) 6,14 (2) 3,7,12,13 (4) (0) IG(S) = Entropy(S) – Sum(|S_i|/|S|*Entropy(S_i)) IG(Outlook)= Entropy(Outlook) -|Sunny|/|Outlook|*Entropy(Sunny) -|Overcast|/|Outlook|*Entropy(Overcast) -|Rain|/|Outlook|*Entropy(Rain) Entropy(Outlook) =-(9/14)*log(9/14)-(5/14)*log(5/14) |Sunny|/|Outlook|*Entropy(Sunny) = 5/14*(-(2/5)*log(2/5)-(3/5)*log(3/5)) |Overcast|/|Outlook|*Entropy(Overcast) = 4/14*(-(4/4)*log(4/4)-(0/4)*log(0/4)) |Rain|/|Outlook|*Entropy(Rain) = 5/14*(-(3/5)*log(3/5)-(2/5)*log(2/5))
  • 10. Zero Occurrence • When a feature is never occurred in the training set  zero frequency  PANIC: makes all terms zero • Smoothing the distribution ▫ Laplacian Smoothing ▫ Dirichlet Priors Smoothing ▫ and many more (Absolute Discouting, Jelinek- Mercer smoothing, Katz smoothing, Good-Turing smoothing, etc.)
  • 11. Calculating IG with Laplacian smoothing Play=Yes Play=No Outlook Sunny 3,4,5,7,9,10,11,12,13 (9) 1,2,6,8,14 (5) Play=Yes Play=No Overcast Rain 9,11 (2) 1,2,8 (3) Play=Yes Play=No Play=Yes Play=No 4,5,10 (3) 6,14 (2) 3,7,12,13 (4) (0) IG(S) = Entropy(S) – Sum(|S_i|/|S|*Entropy(S_i)) IG(Outlook)= Entropy(Outlook) -|Sunny|/|Outlook|*Entropy(Sunny) -|Overcast|/|Outlook|*Entropy(Overcast) -|Rain|/|Outlook|*Entropy(Rain) Entropy(Outlook) =-(10/16)*log(10/16)-(6/16)*log(6/16) |Sunny|/|Outlook|*Entropy(Sunny) = 6/17*(-(3/7)*log(3/7)-(4/7)*log(4/7)) |Overcast|/|Outlook|*Entropy(Overcast) = 5/17*(-(5/6)*log(5/6)-(1/6)*log(1/6)) |Rain|/|Outlook|*Entropy(Rain) = 6/17*(-(4/7)*log(4/7)-(3/7)*log(3/7))
  • 12. Overfitting • Training set error ▫ Error of the classifier on the training data ▫ It is a bad idea to use up all data for training.  You will be out of data to evaluate the learning algorithm. • Test set error ▫ Error of the classifier on the test data ▫ Jackknife – Use n-1 examples to learn and 1 to test. Repeat n times. ▫ x-folds stratified cross-validation – Divide data into x-folds with the same proportion of class. x-1 folds to train and 1 fold to test. Repeat x times. • Overfitting ▫ The input data is incomplete (Quine) ▫ The input data do not reflect all possible cases. ▫ The input data can include noise. ▫ I.e. fit the classifier tightly to the input data is a bad idea. • Occam’s razor ▫ Old axiom used to prove the existence of God. ▫ “plurality should not be posited without necessity”
  • 13. Razors and Canon • Occam's razor (Ockham's razor) ▫ "Plurality is not to be posited without necessity" ▫ Similar to a principle of parsimony ▫ If two hypothesis have almost equal prediction power, we prefer more concise one. • Hanlon's razor ▫ Never attribute to malice that which is adequately explained by stupidity. • Morgan's Canon ▫ In no case is an animal activity to be interpreted in terms of higher psychological processes if it can be fairly interpreted in terms of processes which stand lower in the scale of psychological evolution and development.
  • 14. Example: Playing Tennis (taken from Andrew Moore’s) Humidity (9+, 5-) Wind (9+, 5-) High Norm Weak Strong (3+, 4-) (6+, 1-) (3+, 3-) (6+, 2-) P(h, p) P(n, p) P( w, p) P ( s, p ) I h  P(h, p) log  P(n, p) log  I w  P( w, p) log  P( s, p) log  P ( h) P ( p ) P ( n) P ( p ) P( w) P( p) P( s) P( p) P(h, p) P(n, p) P( w, p) P( s, p) P(h, p) log  P(n, p) log P( w, p) log  P( s, p) log P(h) P(p) P(n) P(p) P( w) P(p) P( s ) P(p)  0.151  0.048
  • 15. Predication for Nodes What is the predication for each node? From Andrew Moore’s slides
  • 17. Recursively Growing Trees cylinders = 4 cylinders = 5 cylinders = 6 Original Partition it Dataset according cylinders = 8 to the value of the attribute we split on From Andrew Moore slides
  • 18. Recursively Growing Trees Build tree from Build tree from Build tree from Build tree from These records.. These records.. These records.. These records.. cylinders = 4 cylinders = 5 cylinders = 6 cylinders = 8 From Andrew Moore slides
  • 19. A Two Level Tree Recursively growing trees
  • 20. When should We Stop Growing Trees? Should we split this node ?
  • 21. Base Cases • Base Case One: If all records in current data subset have the same output then don’t recurse • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse
  • 22. Base Cases: An idea • Base Case One: If all records in current data subset have the same output then don’t recurse • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse Proposed Base Case 3: If all attributes have zero information gain then don’t recurse Is this a good idea?
  • 24. Pruning • Prepruning (=forward pruning) • Postpruning (=backward pruning) ▫ Reduced error pruning ▫ Rule-post pruning
  • 25. Pruning Decision Tree • Stop growing trees in time • Build the full decision tree as before. • But when you can grow it no more, start to prune: ▫ Reduced error pruning ▫ Rule post-pruning
  • 26. Reduced Error Pruning • Split data into training and validation set • Build a full decision tree over the training set • Keep removing node that maximally increases validation set accuracy
  • 30. Rule Post-Pruning • Convert tree into rules • Prune rules by removing the preconditions • Sort final rules by their estimated accuracy Most widely used method (e.g., C4.5) Other methods: statistical significance test (chi- square)
  • 31. Real Value Inputs • What should we do to deal with real value inputs? mpg cylinders displacementhorsepower weight acceleration modelyear maker good 4 97 75 2265 18.2 77 asia bad 6 199 90 2648 15 70 america bad 4 121 110 2600 12.8 77 europe bad 8 350 175 4100 13 73 america bad 6 198 95 3102 16.5 74 america bad 4 108 94 2379 16.5 73 asia bad 4 113 95 2228 14 71 asia bad 8 302 139 3570 12.8 78 america : : : : : : : : : : : : : : : : : : : : : : : : good 4 120 79 2625 18.6 82 america bad 8 455 225 4425 10 70 america good 4 107 86 2464 15.5 76 europe bad 5 131 103 2830 15.9 78 europe
  • 32. Information Gain • x: a real value input • t: split value • Find the split value t such that the mutual information I(x, y: t) between x and the class label y is maximized.
  • 33. Pros and Cons • Pros ▫ Easy to understand ▫ Fast learning algorithms (because they are greedy) ▫ Robust to noise ▫ Good accuracy • Cons ▫ Unstable ▫ Hard to represent some functions (Parity, XOR, etc.) ▫ Duplication in subtrees ▫ Cannot be used to express all first order logic because the test cannot refer to two or more different objects
  • 34. Generation of data from a decision tree (based on the definition #2) • Decision tree with support for each node  Rule set ▫ support = # of training instances assigned for a node • Rule set  Instances • In this way, one can combine multiple decision trees by combining rule sets • cf. Bayesian classifiers  Fractional instances
  • 35. Extensions and further considerations • Extensions ▫ Alternating decision tree ▫ Naï Bayes Tree ve ▫ Attribute Value Taxonomy guided Decision Tree ▫ Recursive Naï Bayes ve ▫ and many more • Further Researches ▫ Decision graph ▫ Bottom up generation of decision tree ▫ Evolutionary construction of decision tree ▫ Integrating two decision trees ▫ and many more