In this lecture, we talk about two different discriminative machine learning methods: decision trees and k-nearest neighbors. Decision trees are hierarchical structures.k-nearest neighbors are based on two principles: recollection and resemblance.
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors
1. Lecture 2:
Decision Trees – Nearest Neighbors
Last Updated: 09 Sept 2013
Uppsala University
Department of Linguistics and Philology, September 2013
Machine Learning for Language Technology
(Schedule)
1 Lec 2: Decision Trees - Nearest Neighbors
2. Most slides from previous courses (i.e.
from E. Alpaydin and J. Nivre) with
adaptations
2 Lec 2: Decision Trees - Nearest Neighbors
3. Supervised Classification
3
Divide instances into (two or more) classes
Instance (feature vector):
Features may be categorical or numerical
Class (label):
Training data:
Classification in Language Technology
Spam filtering (spam vs. non-spam)
Spelling error detection (error vs. no error)
Text categorization (news, economy, culture, sport, ...)
Named entity classification (person, location, organization,
...)
X {xt
,yt
}t 1
N
x x1, , xm
y
Lec 2: Decision Trees - Nearest Neighbors
4. Concept=the thing to be learned (here, how to decide the genre of a new
webpage);
Instances=examples (here a webpage)
Features=attributes (here morphological features, normalized counts)
Classes=labels (here webgenres)
Lec 2: Decision Trees - Nearest Neighbors4
5. Format for the software you chose (here
arff)
Lec 2: Decision Trees - Nearest Neighbors5
6. Models for Classification
6
Generative probabilistic models:
Model of P(x, y)
Naive Bayes
Conditional probabilistic models:
Model of P(y | x)
Logistic regression (next time)
Discriminative model:
No explicit probability model
Decision trees, nearest neighbor classification (today)
Perceptron, support vector machines, MIRA (next time)
Lec 2: Decision Trees - Nearest Neighbors
7. Repeating…
Lec 2: Decision Trees - Nearest Neighbors7
Noise
Data cleaning is expensive
and time consuming
Margin
Inductive bias
Types of inductive biases
•Minimum cross-validation error
•Maximum margin
•Minimum description length
•[...]
9. Decision Trees
9
Hierarchical tree structure for classification
Each internal node specifies a test of some
feature
Each branch corresponds to a value for the
tested feature
Each leaf node provides a classification for the
instance
Represents a disjunction of conjunctions of
constraints
Each path from root to leaf specifies a
conjunction of tests
The tree itself represents the disjunction of all
paths
Lec 2: Decision Trees - Nearest Neighbors
11. Divide and Conquer
Lec 2: Decision Trees - Nearest Neighbors11
Internal decision nodes
Univariate: Uses a single attribute, xi
Numeric xi : Binary split : xi > wm
Discrete xi : n-way split for n possible values
Multivariate: Uses all attributes, x
Leaves
Classification: class labels (or proportions)
Regression: r average (or local fit)
Learning:
Greedy recursive algorithm
Find best split X = (X1, ..., Xp), then induce tree for
each Xi
12. Classification Trees
(ID3, CART, C4.5)
Lec 2: Decision Trees - Nearest Neighbors12
ˆP Ci | x,m pm
i Nm
i
Nm
For node m, Nm instances reach m, Ni
m belong to Ci
Node m is pure if pi
m is 0 or 1
Measure of impurity is entropy
Im pm
i
log2 pm
i
i 1
K
14. Best Split
Lec 2: Decision Trees - Nearest Neighbors14
ˆP Ci |x,m, j pmj
i Nmj
i
Nmj
If node m is pure, generate a leaf and stop,
otherwise split with test t and continue
recursively
Find the test that minimizes impurity
Impurity after split with test t:
Nmj of Nm take branch j
Ni
mj belong to Ci
Im
t Nmj
Nmj 1
n
pmj
i
log2 pmj
i
i 1
K
Im pm
i
log2 pm
i
i 1
K
ˆP Ci | x,m pm
i Nm
i
Nm
16. Information Gain
Lec 2: Decision Trees - Nearest Neighbors16
We want to determine which attribute in a given set
of training feature vectors is most usefulfor
discriminating between the classes to be learned.
•Information gain tells us how important a given
attribute of the feature vectors is.
•We will use it to decide the ordering of attributes in
the nodes of a decision tree.
17. Information Gain and Gain Ratio
17
Choosing the test that
minimizes impurity maximizes
the information gain (IG):
Information gain prefers
features with many values
The normalized version is called
gain ratio (GR):
Im
t Nmj
Nmj 1
n
pmj
i
log2 pmj
i
i 1
K
IGm
t
Im Im
t
Vm
t Nmj
Nm
log2
Nmj
Nmj 1
n
GRm
t IGm
t
Vm
t
Lec 2: Decision Trees - Nearest Neighbors
Im pm
i
log2 pm
i
i 1
K
18. Pruning Trees
Lec 2: Decision Trees - Nearest Neighbors18
Decision trees are susceptible to overfitting
Remove subtrees for better generalization:
Prepruning: Early stopping (e.g., with entropy
threshold)
Postpruning: Grow whole tree, then prune
subtrees
Prepruning is faster, postpruning is more
accurate (requires a separate validation set)
19. Rule Extraction from Trees
Lec 2: Decision Trees - Nearest Neighbors
19
C4.5Rules
(Quinlan, 1993)
20. Learning Rules
Lec 2: Decision Trees - Nearest Neighbors20
Rule induction is similar to tree induction but
tree induction is breadth-first
rule induction is depth-first (one rule at a time)
Rule learning:
A rule is a conjunction of terms (cf. tree path)
A rule covers an example if all terms of the rule
evaluate to true for the example (cf. sequence of
tests)
Sequential covering: Generate rules one at a time
until all positive examples are covered
IREP (Fürnkrantz and Widmer, 1994), Ripper
(Cohen, 1995)
21. Properties of Decision Trees
21
Decision trees are appropriate for classification
when:
Features can be both categorical and numeric
Disjunctive descriptions may be required
Training data may be noisy (missing values, incorrect
labels)
Interpretation of learned model is important (rules)
Inductive bias of (most) decision tree
learners:
1. Prefers trees with informative attributes close to the
root
2. Prefers smaller trees over bigger ones (with
pruning)
3. Preference bias (incomplete search of complete
Lec 2: Decision Trees - Nearest Neighbors
23. Nearest Neighbor Classification
23
An old idea
Key components:
Storage of old instances
Similarity-based reasoning to new instances
This “rule of nearest neighbor” has considerable
elementary intuitive appeal and probably corresponds to
practice in many situations. For example, it is possible
that much medical diagnosis is influenced by the doctor's
recollection of the subsequent history of an earlier patient
whose symptoms resemble in some way those of the
current patient. (Fix and Hodges, 1952)
Lec 2: Decision Trees - Nearest Neighbors
24. k-Nearest Neighbour
24
Learning:
Store training instances in memory
Classification:
Given new test instance x,
Compare it to all stored instances
Compute a distance between x and each stored instance xt
Keep track of the k closest instances (nearest neighbors)
Assign to x the majority class of the k nearest neighbours
A geometric view of learning
Proximity in (feature) space same class
The smoothness assumption
Lec 2: Decision Trees - Nearest Neighbors
25. Eager and Lazy Learning
25
Eager learning (e.g., decision trees)
Learning – induce an abstract model from data
Classification – apply model to new data
Lazy learning (a.k.a. memory-based learning)
Learning – store data in memory
Classification – compare new data to data in memory
Properties:
Retains all the information in the training set – no abstraction
Complex hypothesis space – suitable for natural language?
Main drawback – classification can be very inefficient
Lec 2: Decision Trees - Nearest Neighbors
26. Dimensions of a k-NN Classifier
26
Distance metric
How do we measure distance between instances?
Determines the layout of the instance space
The k parameter
How large neighborhood should we consider?
Determines the complexity of the hypothesis space
Lec 2: Decision Trees - Nearest Neighbors
27. Distance Metric 1
27
Overlap = count of mismatching features
(x,z) (xi,zi )
i 1
m
ii
ii
ii
ii
ii
zxif
zxif
elsenumericif
zx
zx
1
0
,
minmax
),(
Lec 2: Decision Trees - Nearest Neighbors
28. Distance Metric 2
28
MVDM = Modified Value Difference Metric
(x,z) (xi,zi )
i 1
m
(xi,zi) P(Cj | xi) P(Cj | zi)
j 1
K
Lec 2: Decision Trees - Nearest Neighbors
29. The k parameter
29
Tunes the complexity of the hypothesis space
If k = 1, every instance has its own neighborhood
If k = N, all the feature space is one neighborhood
k = 1 k = 15
Lec 2: Decision Trees - Nearest Neighbors
ˆE E(h | V ) 1 h xt
rt
t 1
M
30. A Simple Example
30
Training set:
1. (a, b, a, c) A
2. (a, b, c, a) B
3. (b, a, c, c) C
4. (c, a, b, c) A
New instance:
5. (a, b, b, a)
Distances (overlap):
Δ(1, 5) = 2
Δ(2, 5) = 1
Δ(3, 5) = 4
Δ(4, 5) = 3
k-NN classification:
1-NN(5) = B
2-NN(5) = A/B
3-NN(5) = A
4-NN(5) = A
Lec 2: Decision Trees - Nearest Neighbors
(x,z) (xi,zi )
i 1
m
ii
ii
ii
ii
ii
zxif
zxif
elsenumericif
zx
zx
1
0
,
minmax
),(
31. Further Variations on k-NN
31
Feature weights:
The overlap metric gives all features equal weight
Features can be weighted by IG or GR
Weighted voting:
The normal decision rule gives all neighbors equal weight
Instances can be weighted by (inverse) distance
Lec 2: Decision Trees - Nearest Neighbors
32. Properties of k-NN
32
Nearest neighbor classification is appropriate when:
Features can be both categorical and numeric
Disjunctive descriptions may be required
Training data may be noisy (missing values, incorrect
labels)
Fast classification is not crucial
Inductive bias of k-NN:
1. Nearby instances should have the same label
(smoothness)
2. All features are equally important (without feature
weights)
3. Complexity tuned by the k parameter
Lec 2: Decision Trees - Nearest Neighbors
33. Reading
Lec 2: Decision Trees - Nearest Neighbors33
Alpaydin (2010): Ch. 3, 9
Daumé (2012): Ch. 1-2
Witten and Frank (2005): Ch. 2
Tom Mitchell (1997) Ch.3: Decision Tree Learning
34. End of Lecture 2
Thanks for your attention
Lec 2: Decision Trees - Nearest Neighbors34