SlideShare uma empresa Scribd logo
1 de 86
Baixar para ler offline
WHEREWE ARE
Informatics
• management, manipulation, integration
• emphasis on scale, some emphasis on tools
Analytics
• statistical estimation and prediction
• machine learning, data mining
Visualization
• communication and presentation
Bill Howe, UW
WHAT IS MACHINE LEARNING?
“Systems that automatically learn programs from data”
Teaching a computer about the world
Bill Howe, UW
2
[Domingos 2012]
[Mark Dredze]
WHAT’S THE DIFFERENCE BETWEEN STATISTICS
AND MACHINE LEARNING?
Bill Howe, UW
3
Leo Breiman,Statistical Modeling:The Two Cultures,
Statistical Science 16(3),2001
One view: Emphasis on stochastic models of
nature:
Find a function that predicts y from x: no
model of nature implied or needed
TOY EXAMPLE
Bill Howe, UW
4
[Witten]
hypothesis:
we only play
when its
sunny?
No
hypothesis: we don’t play if its rainy and windy?
No
Goal: Predict when we play
TERMINOLOGY
classification
• The learned attribute is categorical
(“nominal”)
regression
• The learned attribute is numeric
Bill Howe, UW
5
TERMINOLOGY
Supervised Learning (“Training”)
•  We are given examples of inputs and associated outputs
•  We learn the relationship between them
Unsupervised Learning (sometimes:“Mining”)
•  We are given inputs, but no outputs
•  unlabeled data
•  Learn the “latent” labels
•  Ex: Clustering, dimension reduction
Bill Howe, UW
6
EXAMPLE:DOCUMENT CLASSIFICATION
Bill Howe, UW
7
“The Falcons trounced
the Saints on Sunday”
Sports
“The Mars Rover discovered
organic molecules on
Sunday”
Science
How do we set this up?
What are the rows and columns of our decision
table?
EXAMPLE:CONSTRUCTING THE DOCUMENT
MATRIX
Bill Howe, UW
d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 :“Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.
dagger die new-hampshir free happi live new-england motto romeo juliet
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 1]
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0]
[0, 1, 1, 1, 0, 1, 0, 1, 0, 0]
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
d1
d2
d3
d4
d5
EXAMPLE:DOCUMENT CLASSIFICATION
Supervised Learning Problem
• A human assigns a topic label to each document in a
corpus
• The algorithm learns how to predict the label
Unsupervised Learning Problem
• No labels are given
• Discover groups of similar documents
Bill Howe, UW
LEARNING = THREE CORE COMPONENTS
Representation
Evaluation
Optimization
Bill Howe, UW
Pedro Domingos, A Few Useful Things to Know about Machine Learning,
CACM 55(10), 2012
LEARNING = THREE CORE COMPONENTS
Representation
•  What exactly is your classifier?
•  A hyperplane that separates the two classes?
•  A decision tree?
•  A neural network?
Evaluation
Optimization
Bill Howe, UW
Pedro Domingos, A Few Useful Things to Know about Machine Learning,
CACM 55(10), 2012
LEARNING = THREE CORE COMPONENTS
Representation
Evaluation
•  How do we know if a given classifier is good or bad?
•  # of errors on some test set?
•  Precision and recall?
•  Squared error?
•  Likelihood?
Optimization
Bill Howe, UW
Pedro Domingos, A Few Useful Things to Know about Machine Learning,
CACM 55(10), 2012
LEARNING = THREE CORE COMPONENTS
Representation
Evaluation
Optimization
•  How do you search among all the alternatives?
•  Greedy search?
•  Gradient descent?
Bill Howe, UW
Pedro Domingos, A Few Useful Things to Know about Machine Learning,
CACM 55(10), 2012
TITANIC DATASET
Bill Howe, UW
14
Bill Howe, UW
15
Bill Howe, UW
16
AVERY NAÏVE CLASSIFIER
Bill Howe, UW
17
pclass sex age sibsp parch fare cabin embarked
Does the new data point x* exactly match a previous point xi?
If so, assign it to the same class as xi
Otherwise, just guess.
This is the “rote” classifier
A MINOR IMPROVEMENT
Bill Howe, UW
18
Does the new data point x* match a set pf previous points xi on some specific
attribute?
If so, take a vote to determine class.
Example: If most females survived, then assume every female survives
But there are lots of possible rules like this.
And an attribute can have more than two values.
If most people under 4 years old survive, then assume everyone under 4 survives
If most people with 1 sibling survive, then assume everyone with 1 sibling survives
How do we choose?
IF sex = ‘female’THEN survive = yes
ELSE IF sex = ‘male’THEN survive = no
Bill Howe, UW
19
confusion matrix
no yes <-- classified as
468 109 | no
81 233 | yes
Not bad!
468+ 233
468+109 +81+ 233
= 79% correct (and 21% incorrect)
Bill Howe, UW
20
IF pclass=‘1’THEN survive=yes
ELSE IF pclass=‘2’THEN survive=yes
ELSE IF pclass=‘3’THEN survive=no
Bill Howe, UW
21
confusion matrix
no yes <-- classified as
372 119 | no
177 223 | yes
a bit worse…
372 + 223
372 +119 + 223+177
= 67% correct (and 33% incorrect)
1-RULE
Bill Howe, UW
22
For each attribute A:
For each value V of that attribute, create a rule:
1. count how often each class appears
2. find the most frequent class, c
3. make a rule "if A=V then Class=c"
Calculate the error rate of this rule
Pick the attribute whose rules produce the lowest error
rate
Bill Howe, UW
HOW FAR CANWE GO?
Bill Howe, UW
IF pclass=‘1’ AND sex=‘female’THEN survive=yes
IF pclass=‘2’ AND sex=‘female’THEN survive=yes
IF pclass=‘3’ AND sex=‘female’ AND age < 4 THEN survive=yes
IF pclass=‘3’ AND sex=‘female’ AND age >= 4 THEN survive=no
IF pclass=‘2’ AND sex=‘male’THEN survive=no
IF pclass=‘3’ AND sex=‘male’THEN survive=no
IF pclass=‘1’ AND sex=‘male’ AND age < 5 THEN survive=yes
…
SEQUENTIAL COVERING
Bill Howe, UW
Initialize R to the empty set
for each class C {
while D is nonempty {
Construct one rule r that correctly classifies
some instances in D that belong to class C and
does not incorrectly classify any non-C instances
Add rule r to ruleset R
Remove from D all instances correctly classified by r
}
}
return R
src:Alvarez
SEQUENTIAL COVERING:FINDING NEXT RULE FOR
CLASS C
Bill Howe, UW
src:Alvarez
Initialize A as the set of all attributes over D
while r incorrectly classifies some non-C instances of D {
write r as ant(r) => C
for each attribute-value pair (a=v),
where a belongs to A and v is a value of a,
compute the accuracy of the rule
ant(r) and (a=v) => C
let (a*=v*) be the attribute-value pair of
maximum accuracy over D; in case of a tie,
choose the pair that covers the greatest
number of instances of D
update r by adding (a*=v*) to its antecedent:
r = ( ant(r) and (a*=v*) ) => C
remove the attribute a* from the set A:
A = A - {a*}
}
STRATEGIES FOR LEARNING EACH RULE
General-to-Specific
•  Start with an empty rule
•  Add constraints to eliminate negative examples
•  Stop when only positives are covered
Specific-to-General (not shown)
•  Start with a rule that identifies a single random instance
•  Remove constraints in order to cover more positives
•  Stop when further generalization results in covering negatives
Bill Howe, UW
CONFLICTS
If more than one rule is triggered
•  Choose the “most specific” rule
•  Use domain knowledge to order rules by priority
Bill Howe, UW
RECAP
Representation
• A set of rules: IF…THEN conditions
Evaluation
• coverage: # of data points that satisfy conditions
• accuracy = # of correct predictions / coverage
Optimization
• Build rules by finding conditions that maximize
accuracy
Bill Howe, UW
One rule is easy to
interpret, but a complex
set of rules probably
isn’t
HOW FAR CANWE GO?
We might consider grouping redundant conditions
Bill Howe, UW
IF pclass=‘1’THEN
IF sex=‘female’THEN survive=yes
IF sex=‘male’ AND age < 5 THEN survive=yes
IF pclass=‘2’
IF sex=‘female’THEN survive=yes
IF sex=‘male’THEN survive=no
IF pclass=‘3’
IF sex=‘male’THEN survive=no
IF sex=‘female’
IF age < 4 THEN survive=yes
IF age >= 4 THEN survive=no
A decision tree
HOW FAR CANWE GO?
31
sex
age
pclass
female
male
survive
not survive
3
1
not survive
survive
<=4
not survive
>4
Every path
from the root is
a rule
2
no noyesyes
….Sports Science
yes no
Includes
“Sunday”
Includes “Saints” Includes “Mars”
EXAMPLE:DOCUMENT CLASSIFICATION
ASIDE ON ENTROPY
Consider two sequences of coin flips:
We want some function “Information” that satisfies:
Information1&2(p1p2) = Information1(p1) + Information2(p2)
HHTHTTHHHHTTHTHTHTTTT….
TTHHTTHTHTTTTHHHTHTTT….
How much information do we get after flipping each coin once?
Expected Information = “Entropy”
H(X) = E(I(X)) =
X
x
pxI(x) =
X
x
px log2 px
I(X) = log2 px
EXAMPLE:FLIPPING A COIN
Entropy =
X
i
px log2 px
= (0.5 log2 0.5 + 0.5 log2 0.5)
= 1
EXAMPLE:ROLLING A DIE
= 6 ⇥
✓
1
6
log2
1
6
◆
⇡ 2.58
p1 =
1
6
, p2 =
1
6
, p3 =
1
6
, . . .
Entropy =
X
i
pi log2 pi
EXAMPLE:ROLLING AWEIGHTED DIE
Entropy =
X
i
px log2 px
p1 = 0.1, p2 = 0.1, p3 = 0.1, . . . p6 = 0.5
= 5 ⇥ (0.1 log2 0.1) 0.5 log2 0.5
= 2.16
The weighted die is more unpredictable than a fair die
HOW UNPREDICTABLE ISYOUR DATA?
342/891 survivors in titanic training set
Say there were only 50 survivors
✓
342
891
log2
342
891
+
549
891
log2
549
891
◆
= 0.96
✓
50
891
log2
50
891
+
841
891
log2
841
891
◆
= 0.31
BACK TO DECISION TREES
Which attribute do we choose at each level?
The one with the highest information gain
•  The one that reduces the unpredictability the most
Before: 14 records, 9 are “yes”
✓
9
14
log2
9
14
+
5
14
log2
5
14
◆
= 0.94
If we choose outlook:
overcast : 4 records, 4 are “yes”
rainy : 5 records, 3 are “yes”
sunny : 5 records, 2 are “yes”
Expected new entropy:
✓
4
4
log2
4
4
◆
= 0
✓
3
5
log2
3
5
+
2
5
log2
2
5
◆
= 0.97
✓
2
5
log2
2
5
+
3
5
log2
3
5
◆
= 0.97
4
14
⇥ 0.0 +
5
14
⇥ 0.97 +
5
14
⇥ 0.97
= 0.69
outlook temperature	 humidity	 windy	 play	
overcast cool	 normal	 TRUE	 yes	
overcast hot	 high	 FALSE	 yes	
overcast hot	 normal	 FALSE	 yes	
overcast mild	 high	 TRUE	 yes	
rainy cool	 normal	 TRUE	 no	
rainy mild	 high	 TRUE	 no	
rainy cool	 normal	 FALSE	 yes	
rainy mild	 high	 FALSE	 yes	
rainy mild	 normal	 FALSE	 yes	
sunny hot	 high	 FALSE	 no	
sunny hot	 high	 TRUE	 no	
sunny mild	 high	 FALSE	 no	
sunny cool	 normal	 FALSE	 yes	
sunny mild	 normal	 TRUE	 yes
outlook temperature	 humidity	 windy	 play	
overcast cool	 normal	 TRUE	 yes	
overcast hot	 high	 FALSE	 yes	
overcast hot	 normal	 FALSE	 yes	
overcast mild	 high	 TRUE	 yes	
rainy cool	 normal	 TRUE	 no	
rainy mild	 high	 TRUE	 no	
rainy cool	 normal	 FALSE	 yes	
rainy mild	 high	 FALSE	 yes	
rainy mild	 normal	 FALSE	 yes	
sunny hot	 high	 FALSE	 no	
sunny hot	 high	 TRUE	 no	
sunny mild	 high	 FALSE	 no	
sunny cool	 normal	 FALSE	 yes	
sunny mild	 normal	 TRUE	 yes	
Before: 14 records, 9 are “yes”
✓
9
14
log2
9
14
+
5
14
log2
5
14
◆
= 0.94
If we choose temperature:
cool: 4 records, 3 are “yes”
= 0.81
rainy : 4 records, 2 are “yes”
=1.0
sunny : 6 records, 4 are “yes”
=0.92
Expected new entropy:
= 0.91
0.81(4/14) + 1.0(4/14) + 0.92(6/14)
outlook temperature	 humidity	 windy	 play	
overcast cool	 normal	 TRUE	 yes	
overcast hot	 high	 FALSE	 yes	
overcast hot	 normal	 FALSE	 yes	
overcast mild	 high	 TRUE	 yes	
rainy cool	 normal	 TRUE	 no	
rainy mild	 high	 TRUE	 no	
rainy cool	 normal	 FALSE	 yes	
rainy mild	 high	 FALSE	 yes	
rainy mild	 normal	 FALSE	 yes	
sunny hot	 high	 FALSE	 no	
sunny hot	 high	 TRUE	 no	
sunny mild	 high	 FALSE	 no	
sunny cool	 normal	 FALSE	 yes	
sunny mild	 normal	 TRUE	 yes	
Before: 14 records, 9 are “yes”
✓
9
14
log2
9
14
+
5
14
log2
5
14
◆
= 0.94
If we choose humidity:
normal: 7 records, 6 are “yes”
= 0.59
high: 7 records, 2 are “yes”
=0.86
Expected new entropy:
= 0.725
0.59(7/14) + 0.86(7/14)
outlook temperature	 humidity	 windy	 play	
overcast cool	 normal	 TRUE	 yes	
overcast hot	 high	 FALSE	 yes	
overcast hot	 normal	 FALSE	 yes	
overcast mild	 high	 TRUE	 yes	
rainy cool	 normal	 TRUE	 no	
rainy mild	 high	 TRUE	 no	
rainy cool	 normal	 FALSE	 yes	
rainy mild	 high	 FALSE	 yes	
rainy mild	 normal	 FALSE	 yes	
sunny hot	 high	 FALSE	 no	
sunny hot	 high	 TRUE	 no	
sunny mild	 high	 FALSE	 no	
sunny cool	 normal	 FALSE	 yes	
sunny mild	 normal	 TRUE	 yes	
Before: 14 records, 9 are “yes”
✓
9
14
log2
9
14
+
5
14
log2
5
14
◆
= 0.94
If we choose windy:
TRUE: 8 records, 6 are “yes”
= 0.81
FALSE: 5 records, 3 are “yes”
=0.97
Expected new entropy:
= 0.87
0.81(8/14) + 0.97(6/14)
outlook temperature	 humidity	 windy	 play	
overcast cool	 normal	 TRUE	 yes	
overcast hot	 high	 FALSE	 yes	
overcast hot	 normal	 FALSE	 yes	
overcast mild	 high	 TRUE	 yes	
rainy cool	 normal	 TRUE	 no	
rainy mild	 high	 TRUE	 no	
rainy cool	 normal	 FALSE	 yes	
rainy mild	 high	 FALSE	 yes	
rainy mild	 normal	 FALSE	 yes	
sunny hot	 high	 FALSE	 no	
sunny hot	 high	 TRUE	 no	
sunny mild	 high	 FALSE	 no	
sunny cool	 normal	 FALSE	 yes	
sunny mild	 normal	 TRUE	 yes	
Before: 14 records, 9 are “yes”
✓
9
14
log2
9
14
+
5
14
log2
5
14
◆
= 0.94
outlook
0.94 – 0.69 = 0.25
temperature
0.94 – 0.91 = 0.03
humidity
0.94 – 0.725 = 0.215
windy
0.94 – 0.87 = 0.07
highest gain
DOCUMENT CLASSIFICATION
Includes
“Falcons”
yes no
Includes “Mars”
yes no
5% “Sports”
95% “Science”
50% “Sports”
50% “Science”
50% “Sports”
50% “Science”
50% “Sports”
50% “Science”
BUILDING A DECISION TREE (ID3
ALGORITHM)
Assume attributes are discrete
•  Discretize continuous attributes
Choose the attribute with the highest Information
Gain
Create branches for each value of attribute
Examples partitioned based on selected attributes
Repeat with remaining attributes
Stopping conditions
•  All examples assigned the same label
•  No examples left
Bill Howe, UW
PROBLEMS
Expensive to train
Prone to overfitting
• Drive to perfection on training data, bad on test data
• Pruning can help: remove or aggregate subtrees that
provide little discriminatory power (C45)
C4.5 EXTENSIONS
Continuous Attributes
outlook temperature	 humidity	 windy	 play	
overcast cool	 60	 TRUE	 yes	
overcast hot	 80	 FALSE	 yes	
overcast hot	 63	 FALSE	 yes	
overcast mild	 81	 TRUE	 yes	
rainy cool	 58	 TRUE	 no	
rainy mild	 90	 TRUE	 no	
rainy cool	 54	 FALSE	 yes	
rainy mild	 92	 FALSE	 yes	
rainy mild	 59	 FALSE	 yes	
sunny hot	 90	 FALSE	 no	
sunny hot	 89	 TRUE	 no	
sunny mild	 90	 FALSE	 no	
sunny cool	 60	 FALSE	 yes	
sunny mild	 62	 TRUE	 yes
outlook temperature	 humidity	 windy	 play	
rainy mild	 54	 FALSE	 yes	
overcast hot	 58	 FALSE	 yes	
overcast cool	 59	 TRUE	 yes	
rainy cool	 60	 FALSE	 yes	
overcast mild	 60	 TRUE	 yes	
overcast hot	 62	 FALSE	 yes	
rainy mild	 63	 TRUE	 no	
sunny cool	 80	 FALSE	 yes	
rainy mild	 81	 FALSE	 yes	
sunny mild	 89	 TRUE	 yes	
sunny hot	 90	 FALSE	 no	
rainy cool	 90	 TRUE	 no	
sunny hot	 90	 TRUE	 no	
sunny mild	 92	 FALSE	 no	
Consider every possible binary
partition; choose the partition
with the highest gain
Expect =
8
14
(0.95)+
6
14
(0) = 0.54 Expect =
10
14
(0.47)+
4
14
(0) = 0.33
E(
4
4
) = 0.0
E(
6
6
) = 0.0
E(
3
8
)+ E(
5
9
) = 0.95
E(
9
10
)+ E(
1
10
) = 0.47
WHEREWE ARE
Supervised learning and classification problems
•  Predict a class label based on other attributes
Rules
•  To start, we guessed simple rules that might explain the data
•  But the relationships are complex, so we need to automate
•  The 1-rule algorithm
•  A sequential cover algorithm for sets of rules with complex conditions
•  But: Sets of rules are hard to interpret
Decision trees
•  Each path from the root is a tree; easy to interpret
•  Use entropy to choose best attribute at each node
•  Extensions for numeric attributes
•  But: Decision Trees are prone to overfitting
1/16/16
Bill Howe, Data Science, Autumn 2012
49
OVERFITTING
Bill Howe, UW
50
What if the knowledge and data we have are not sufficient to
completely determine the correct classifier? Then we run the risk of
just hallucinating a classifier (or parts of it) that is not grounded in
reality, and is simply encoding random quirks in the data.
This problem is called overfitting, and is the bugbear of machine
learning.When your learner outputs a classifier that is 100% accurate
on the training data but only 50% accurate on test data, when in fact it
could have output one that is 75% accurate on both, it has overfit.
Pedro Domingos, A Few Useful Things to Know About Machine Learning, CACM
55(10), 2012
OVERFITTING
Low error on training data and high error on
test data
Bill Howe, UW
51
OVERFITTING
52
Perhaps not the most
useful intuition
OVERFITTING
Bill Howe, UW
53
Overfitting.png by Dake.
training set
test set
A better image to
remember
Is the model able to generalize? Can it deal
with unseen data, or does it overfit the data?
Test on hold-out data:
• split data to be modeled in training and test set
• train the model on training set
• evaluate the model on the training set
• evaluate the model on the test set
• difference between the fit on training data and test
data measures the model’s ability to generalize
Bill Howe, UW
54
slide src: Frank Keller
Bill Howe, UW
55
src: Domingo 2012
Underfitting:
High Bias
Low Variance
Overfitting:
Low Bias
High Variance
EVALUATION
Division into training and test sets
• Fixed
•  Leave out random N% of the data
• k-fold Cross-Validation
•  Select K folds without replace
• Leave-One-Out Cross Validation
•  Special case
• Related: Bootstrap
•  Generate new training sets by sampling with replacement
Bill Howe, UW
56
LEAVE-ONE-OUT CROSS-VALIDATION
(LOOCV)
Bill Howe, UW
57
For each training example (xi, yi)
Train a classifier with all training data except (xi, yi)
Test the classifier’s accuracy on (xi, yi)
LOOCV accuracy = average of all n accuracies
ACCURACY
Bill Howe, UW
58
Predicted
+ Predicted -
True + a b
True - c d
Confusion Matrix
Accuracy =
a + d
a + b+c+ d
EVALUATION:ACCURACY ISN’T ALWAYS ENOUGH
How do you interpret 90% accuracy?
•  You can’t; it depends on the problem
Need a baseline:
•  Base Rate
•  Accuracy of trivially predicting the most-frequent class
•  Random Rate
•  Accuracy of making a random class assignment
•  Might apply prior knowledge to assign random distribution
•  Naïve Rate
•  Accuracy of some simple default or pre-existing model
•  Ex:“All females survived”
Bill Howe, UW
59
ACCURACY
Bill Howe, UW
60
Predicted
Positive
Predicted
Negative
True
Positive a b
True
Negative c d
Confusion Matrix
Accuracy =
a + d
a + b+c+ d
a
a + b
a +c
a + b+c+ d
lift =
%positives > threshold
%dataset > threshold
ACCURACY
Bill Howe, UW
61
Confusion Matrix
Accuracy =
a + d
a + b+c+ dPredicted
Positive
Predicted
Negative
True
Positive a b
True
Negative c d precision :
a
a +c
recall :
a
a + b
ROC PLOT
Bill Howe, UW
62
Confusion Matrix
“Receiver Operator
Characteristic”
•  Historical term from WW2
•  Used to measure accuracy of
radar operators
Predicted
Positive
Predicted
Negative
True
Positive a b
True
Negative c d
Accuracy =
a + d
a + b+c+ d
sensitivity =
a
a + b
1− specificity =
1− d
c+ d
Bill Howe, UW
63
src: Cesar Souza
THE BOOTSTRAP
Bill Howe, UW
64
Given a dataset of size N
Draw N samples with replacement to create a new
dataset
Repeat ~1000 times
You now have ~1000 sample datasets
•  All drawn from the same population
•  You can compute ~1000 sample statistics
•  You can interpret these as repeated experiments, which is exactly
what the frequentist perspective calls for
Very elegant use of computational resources
Efron 1979
BOOTSTRAP EXAMPLE
Bill Howe, UW
65
1 2 3 4 5 6 3.5
4 3 4 2 1 6 3.33	
2 3 6 1 3 5 3.33	
5 1 1 2 3 6 3.00	
2 5 2 6 3 4 3.67	
4 4 4 2 1 3 3.00	
3 4 5 3 2 1 3.00	
1 2 3 6 6 1 3.17	
5 2 3 1 4 5 3.33	
mean
THE BOOTSTRAP
Bill Howe, UW
66
Example:
Generate 1000 samples and
1000 linear regressions
You want a 90% confidence
interval for the slope?
Just take the 5th percentile
and the 95th percentile!
ENSEMBLES:COMBINING CLASSIFIERS
Can a set of weak classifiers be combined to derive a
strong classifier? Yes!
Average results from different models
Why?
•  Better classification performance than individual classifiers
•  More resilience to noise
Why not?
•  Time consuming
•  Models become difficult to explain
Bill Howe, UW
4
“Wisdom of the (simulated) crowd”
Freund, Schapire 1995
BAGGING
Draw N bootstrap samples
Retrain the model on each sample
Average the results
• Regression: Averaging
• Classification: Majority vote
Works great for overfit models
• Decreases variance without changing bias
• Doesn’t help much with underfit/high bias models
•  Insensitive to the training data
Bill Howe, UW
5
BOOSTING
Bill Howe, UW
6
Instead of selecting data points randomly with the bootstrap,
favor the misclassified points
Initialize the weights
Repeat:
Resample with respect to weights
Retrain the model
Recompute weights
Bill Howe, UW
7
✏t =
X
i:ht(xi)6=yi
Dt(i)
Dt(i)
For each step t
weights: probability of selecting
example i in the sample
ht
trained classifier at step t using sample drawn
according to Dt
xi, yi ith example, ith label
t =
✏t
1 ✏t
sum of weights for
misclassified examples
odds of misclassifying
adjust weights down for
correctly classified examples
…and normalize to
make sure weights
sum to 1
Dt+1(i) = tDt(i)
RANDOM FOREST ALGORITHM
Bill Howe, UW
71
Repeat k times:
•  Draw a bootstrap sample from the dataset
•  Train a decision tree
Until the tree is maximum size
Choose next leaf node
Select m attributes at random from the p available
Pick the best attribute/split as usual
•  Measure out-of-bag error
•  Evaluate against the samples that were not selected in the bootstrap
•  Provides measures of strength (inverse error rate), correlation
between trees (which increases the forest error rate), and variable
importance
Make a prediction by majority vote among the k trees
Breiman 2001
RANDOM FORESTS:VARIABLE IMPORTANCE
Key Idea: If you scramble the values of a variable and
the accuracy of your tree doesn’t change much, then the
variable isn’t very important
Measure the error increase
Random Forests are more difficult to interpret than
single trees; understanding variable importance helps
•  Ex: Medical applications can’t typically rely on black box solutions
Bill Howe, UW
72
Breiman 2001
GINI COEFFICIENT
Entropy captured an intuition for “impurity”
• We want to choose attributes that split records into pure
classes
The gini coefficient measures inequality
Bill Howe, UW
73
RANDOM FORESTS ON BIG DATA
Easy to parallelize
• Trees are built independently
Handles “small n big p” problems naturally
• A subset of attributes are selected by
importance
Bill Howe, UW
74
SUMMARY:DECISION TREES AND FORESTS
Representation
• Decision Trees
• Sets of decision trees with majority vote
Evaluation
• Accuracy
• Random forests: out-of-bag error
Optimization
• Information Gain or Gini Index to measure impurity and
select best attributes
Bill Howe, UW
75
NEAREST NEIGHBOR
Bill Howe, UW
76
NEAREST NEIGHBORS INTUITION
The last document I saw that mentioned “Falcons” and
“Saints” was about Sports, so I’ll classify this document
as about Sports too
Bill Howe, UW
77
NEAREST NEIGHBOR CHOICES
k nearest neighbors – how do we choose k?
•  Benefits of a small k? Benefits of a large k?
Similarity function
•  Euclidean distance? Cosine similarity?
Bill Howe, UW
78
Large k = bias towards popular labels
Large k = ignores outliers
Small k = fast
Cosine = favors dominant components
Euclidean = difficult to interpret with sparse
data, and high-dimensional data is always
sparse
Bill Howe, UW
79
REGRESSION-BASED METHODS
Trees and rules are designed for categorical
data
• Numerical data can be discretized, but this
introduces another decision
Discriminitive models
Bill Howe, UW
80
Bill Howe, UW
81
EVALUATION:MEAN SQUARED ERROR
Bill Howe, UW
82
: number of instances in the data set
H0
i
Hi
MSE =
1
n
nX
i=1
(Hi H0
i)2
: quantity observed in the data set
: quantity predicted by model
n
ASIDE ON ERRORS AND NORMS
Bill Howe, UW
83
X
i
Hi H0
i
errors cancel out;
usually not what you want
X
i
|Hi H0
i| Asserts that 1 error of 7 units is as
bad as 7 errors of 1 unit each
X
i
(Hi H0
i)2 Asserts that 1 error of 7 units is as
bad as 49 errors of 1 unit each
1
n
nX
i
(Hi H0
i)2
Average squared error per data point;
useful when comparing methods that
filter the data differently
L1-norm
L2-norm
not a norm
TERMINOLOGY:NORMS
Bill Howe, UW
84
Lp
-norm =
nX
i=1
|xi|p
!1
p
Norm: any function that assigns a strictly positive
number to every non-zero vector
Bill Howe, UW
85
v
u
u
t
nX
i=1
( 0 + 1xi yi)2
SLIDES CAN BE FOUND AT:
TEACHINGDATASCIENCE.ORG

Mais conteúdo relacionado

Semelhante a Bill howe 6_machinelearning_1

Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inferencebutest
 
Modularity for Accurate Static Analysis of Smart Contracts
Modularity for Accurate Static Analysis of Smart ContractsModularity for Accurate Static Analysis of Smart Contracts
Modularity for Accurate Static Analysis of Smart ContractsFacultad de Informática UCM
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 
Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Yan Xu
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
AML_030607.ppt
AML_030607.pptAML_030607.ppt
AML_030607.pptbutest
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Datadr_jp_ebejer
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basicsDorothy Bishop
 
Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.pptRohanBorgalli
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
 
SPIE Conference V3.0
SPIE Conference V3.0SPIE Conference V3.0
SPIE Conference V3.0Robert Fry
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
Are you better than a coin toss?  - Richard Warbuton & John Oliver (jClarity)Are you better than a coin toss?  - Richard Warbuton & John Oliver (jClarity)
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)jaxLondonConference
 
Pontificating quantification
Pontificating quantificationPontificating quantification
Pontificating quantificationAaron Bedra
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.pptbutest
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsJason Tsai
 

Semelhante a Bill howe 6_machinelearning_1 (20)

Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inference
 
Modularity for Accurate Static Analysis of Smart Contracts
Modularity for Accurate Static Analysis of Smart ContractsModularity for Accurate Static Analysis of Smart Contracts
Modularity for Accurate Static Analysis of Smart Contracts
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
AML_030607.ppt
AML_030607.pptAML_030607.ppt
AML_030607.ppt
 
Week2
Week2Week2
Week2
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Data
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basics
 
DeepLearning
DeepLearningDeepLearning
DeepLearning
 
Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.ppt
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
SPIE Conference V3.0
SPIE Conference V3.0SPIE Conference V3.0
SPIE Conference V3.0
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
Are you better than a coin toss?  - Richard Warbuton & John Oliver (jClarity)Are you better than a coin toss?  - Richard Warbuton & John Oliver (jClarity)
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
 
Pontificating quantification
Pontificating quantificationPontificating quantification
Pontificating quantification
 
week9_Machine_Learning.ppt
week9_Machine_Learning.pptweek9_Machine_Learning.ppt
week9_Machine_Learning.ppt
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
 

Mais de Mahammad Valiyev

Mais de Mahammad Valiyev (6)

Bill howe 8_graphs
Bill howe 8_graphsBill howe 8_graphs
Bill howe 8_graphs
 
Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2
 
Bill howe 5_statistics
Bill howe 5_statisticsBill howe 5_statistics
Bill howe 5_statistics
 
Bill howe 4_bigdatasystems
Bill howe 4_bigdatasystemsBill howe 4_bigdatasystems
Bill howe 4_bigdatasystems
 
Bill howe 3_mapreduce
Bill howe 3_mapreduceBill howe 3_mapreduce
Bill howe 3_mapreduce
 
Bill howe 2_databases
Bill howe 2_databasesBill howe 2_databases
Bill howe 2_databases
 

Último

The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 

Último (20)

The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 

Bill howe 6_machinelearning_1

  • 1. WHEREWE ARE Informatics • management, manipulation, integration • emphasis on scale, some emphasis on tools Analytics • statistical estimation and prediction • machine learning, data mining Visualization • communication and presentation Bill Howe, UW
  • 2. WHAT IS MACHINE LEARNING? “Systems that automatically learn programs from data” Teaching a computer about the world Bill Howe, UW 2 [Domingos 2012] [Mark Dredze]
  • 3. WHAT’S THE DIFFERENCE BETWEEN STATISTICS AND MACHINE LEARNING? Bill Howe, UW 3 Leo Breiman,Statistical Modeling:The Two Cultures, Statistical Science 16(3),2001 One view: Emphasis on stochastic models of nature: Find a function that predicts y from x: no model of nature implied or needed
  • 4. TOY EXAMPLE Bill Howe, UW 4 [Witten] hypothesis: we only play when its sunny? No hypothesis: we don’t play if its rainy and windy? No Goal: Predict when we play
  • 5. TERMINOLOGY classification • The learned attribute is categorical (“nominal”) regression • The learned attribute is numeric Bill Howe, UW 5
  • 6. TERMINOLOGY Supervised Learning (“Training”) •  We are given examples of inputs and associated outputs •  We learn the relationship between them Unsupervised Learning (sometimes:“Mining”) •  We are given inputs, but no outputs •  unlabeled data •  Learn the “latent” labels •  Ex: Clustering, dimension reduction Bill Howe, UW 6
  • 7. EXAMPLE:DOCUMENT CLASSIFICATION Bill Howe, UW 7 “The Falcons trounced the Saints on Sunday” Sports “The Mars Rover discovered organic molecules on Sunday” Science How do we set this up? What are the rows and columns of our decision table?
  • 8. EXAMPLE:CONSTRUCTING THE DOCUMENT MATRIX Bill Howe, UW d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger. d4 :“Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. dagger die new-hampshir free happi live new-england motto romeo juliet [0, 0, 0, 0, 0, 0, 0, 0, 1, 1] [1, 0, 0, 0, 1, 0, 0, 0, 0, 1] [1, 1, 0, 0, 0, 0, 0, 0, 1, 0] [0, 1, 1, 1, 0, 1, 0, 1, 0, 0] [0, 0, 1, 0, 0, 0, 1, 0, 0, 0] d1 d2 d3 d4 d5
  • 9. EXAMPLE:DOCUMENT CLASSIFICATION Supervised Learning Problem • A human assigns a topic label to each document in a corpus • The algorithm learns how to predict the label Unsupervised Learning Problem • No labels are given • Discover groups of similar documents Bill Howe, UW
  • 10. LEARNING = THREE CORE COMPONENTS Representation Evaluation Optimization Bill Howe, UW Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
  • 11. LEARNING = THREE CORE COMPONENTS Representation •  What exactly is your classifier? •  A hyperplane that separates the two classes? •  A decision tree? •  A neural network? Evaluation Optimization Bill Howe, UW Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
  • 12. LEARNING = THREE CORE COMPONENTS Representation Evaluation •  How do we know if a given classifier is good or bad? •  # of errors on some test set? •  Precision and recall? •  Squared error? •  Likelihood? Optimization Bill Howe, UW Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
  • 13. LEARNING = THREE CORE COMPONENTS Representation Evaluation Optimization •  How do you search among all the alternatives? •  Greedy search? •  Gradient descent? Bill Howe, UW Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
  • 17. AVERY NAÏVE CLASSIFIER Bill Howe, UW 17 pclass sex age sibsp parch fare cabin embarked Does the new data point x* exactly match a previous point xi? If so, assign it to the same class as xi Otherwise, just guess. This is the “rote” classifier
  • 18. A MINOR IMPROVEMENT Bill Howe, UW 18 Does the new data point x* match a set pf previous points xi on some specific attribute? If so, take a vote to determine class. Example: If most females survived, then assume every female survives But there are lots of possible rules like this. And an attribute can have more than two values. If most people under 4 years old survive, then assume everyone under 4 survives If most people with 1 sibling survive, then assume everyone with 1 sibling survives How do we choose?
  • 19. IF sex = ‘female’THEN survive = yes ELSE IF sex = ‘male’THEN survive = no Bill Howe, UW 19 confusion matrix no yes <-- classified as 468 109 | no 81 233 | yes Not bad! 468+ 233 468+109 +81+ 233 = 79% correct (and 21% incorrect)
  • 21. IF pclass=‘1’THEN survive=yes ELSE IF pclass=‘2’THEN survive=yes ELSE IF pclass=‘3’THEN survive=no Bill Howe, UW 21 confusion matrix no yes <-- classified as 372 119 | no 177 223 | yes a bit worse… 372 + 223 372 +119 + 223+177 = 67% correct (and 33% incorrect)
  • 22. 1-RULE Bill Howe, UW 22 For each attribute A: For each value V of that attribute, create a rule: 1. count how often each class appears 2. find the most frequent class, c 3. make a rule "if A=V then Class=c" Calculate the error rate of this rule Pick the attribute whose rules produce the lowest error rate
  • 24. HOW FAR CANWE GO? Bill Howe, UW IF pclass=‘1’ AND sex=‘female’THEN survive=yes IF pclass=‘2’ AND sex=‘female’THEN survive=yes IF pclass=‘3’ AND sex=‘female’ AND age < 4 THEN survive=yes IF pclass=‘3’ AND sex=‘female’ AND age >= 4 THEN survive=no IF pclass=‘2’ AND sex=‘male’THEN survive=no IF pclass=‘3’ AND sex=‘male’THEN survive=no IF pclass=‘1’ AND sex=‘male’ AND age < 5 THEN survive=yes …
  • 25. SEQUENTIAL COVERING Bill Howe, UW Initialize R to the empty set for each class C { while D is nonempty { Construct one rule r that correctly classifies some instances in D that belong to class C and does not incorrectly classify any non-C instances Add rule r to ruleset R Remove from D all instances correctly classified by r } } return R src:Alvarez
  • 26. SEQUENTIAL COVERING:FINDING NEXT RULE FOR CLASS C Bill Howe, UW src:Alvarez Initialize A as the set of all attributes over D while r incorrectly classifies some non-C instances of D { write r as ant(r) => C for each attribute-value pair (a=v), where a belongs to A and v is a value of a, compute the accuracy of the rule ant(r) and (a=v) => C let (a*=v*) be the attribute-value pair of maximum accuracy over D; in case of a tie, choose the pair that covers the greatest number of instances of D update r by adding (a*=v*) to its antecedent: r = ( ant(r) and (a*=v*) ) => C remove the attribute a* from the set A: A = A - {a*} }
  • 27. STRATEGIES FOR LEARNING EACH RULE General-to-Specific •  Start with an empty rule •  Add constraints to eliminate negative examples •  Stop when only positives are covered Specific-to-General (not shown) •  Start with a rule that identifies a single random instance •  Remove constraints in order to cover more positives •  Stop when further generalization results in covering negatives Bill Howe, UW
  • 28. CONFLICTS If more than one rule is triggered •  Choose the “most specific” rule •  Use domain knowledge to order rules by priority Bill Howe, UW
  • 29. RECAP Representation • A set of rules: IF…THEN conditions Evaluation • coverage: # of data points that satisfy conditions • accuracy = # of correct predictions / coverage Optimization • Build rules by finding conditions that maximize accuracy Bill Howe, UW One rule is easy to interpret, but a complex set of rules probably isn’t
  • 30. HOW FAR CANWE GO? We might consider grouping redundant conditions Bill Howe, UW IF pclass=‘1’THEN IF sex=‘female’THEN survive=yes IF sex=‘male’ AND age < 5 THEN survive=yes IF pclass=‘2’ IF sex=‘female’THEN survive=yes IF sex=‘male’THEN survive=no IF pclass=‘3’ IF sex=‘male’THEN survive=no IF sex=‘female’ IF age < 4 THEN survive=yes IF age >= 4 THEN survive=no A decision tree
  • 31. HOW FAR CANWE GO? 31 sex age pclass female male survive not survive 3 1 not survive survive <=4 not survive >4 Every path from the root is a rule 2
  • 32. no noyesyes ….Sports Science yes no Includes “Sunday” Includes “Saints” Includes “Mars” EXAMPLE:DOCUMENT CLASSIFICATION
  • 33. ASIDE ON ENTROPY Consider two sequences of coin flips: We want some function “Information” that satisfies: Information1&2(p1p2) = Information1(p1) + Information2(p2) HHTHTTHHHHTTHTHTHTTTT…. TTHHTTHTHTTTTHHHTHTTT…. How much information do we get after flipping each coin once? Expected Information = “Entropy” H(X) = E(I(X)) = X x pxI(x) = X x px log2 px I(X) = log2 px
  • 34. EXAMPLE:FLIPPING A COIN Entropy = X i px log2 px = (0.5 log2 0.5 + 0.5 log2 0.5) = 1
  • 35. EXAMPLE:ROLLING A DIE = 6 ⇥ ✓ 1 6 log2 1 6 ◆ ⇡ 2.58 p1 = 1 6 , p2 = 1 6 , p3 = 1 6 , . . . Entropy = X i pi log2 pi
  • 36. EXAMPLE:ROLLING AWEIGHTED DIE Entropy = X i px log2 px p1 = 0.1, p2 = 0.1, p3 = 0.1, . . . p6 = 0.5 = 5 ⇥ (0.1 log2 0.1) 0.5 log2 0.5 = 2.16 The weighted die is more unpredictable than a fair die
  • 37. HOW UNPREDICTABLE ISYOUR DATA? 342/891 survivors in titanic training set Say there were only 50 survivors ✓ 342 891 log2 342 891 + 549 891 log2 549 891 ◆ = 0.96 ✓ 50 891 log2 50 891 + 841 891 log2 841 891 ◆ = 0.31
  • 38. BACK TO DECISION TREES Which attribute do we choose at each level? The one with the highest information gain •  The one that reduces the unpredictability the most
  • 39. Before: 14 records, 9 are “yes” ✓ 9 14 log2 9 14 + 5 14 log2 5 14 ◆ = 0.94 If we choose outlook: overcast : 4 records, 4 are “yes” rainy : 5 records, 3 are “yes” sunny : 5 records, 2 are “yes” Expected new entropy: ✓ 4 4 log2 4 4 ◆ = 0 ✓ 3 5 log2 3 5 + 2 5 log2 2 5 ◆ = 0.97 ✓ 2 5 log2 2 5 + 3 5 log2 3 5 ◆ = 0.97 4 14 ⇥ 0.0 + 5 14 ⇥ 0.97 + 5 14 ⇥ 0.97 = 0.69 outlook temperature humidity windy play overcast cool normal TRUE yes overcast hot high FALSE yes overcast hot normal FALSE yes overcast mild high TRUE yes rainy cool normal TRUE no rainy mild high TRUE no rainy cool normal FALSE yes rainy mild high FALSE yes rainy mild normal FALSE yes sunny hot high FALSE no sunny hot high TRUE no sunny mild high FALSE no sunny cool normal FALSE yes sunny mild normal TRUE yes
  • 40. outlook temperature humidity windy play overcast cool normal TRUE yes overcast hot high FALSE yes overcast hot normal FALSE yes overcast mild high TRUE yes rainy cool normal TRUE no rainy mild high TRUE no rainy cool normal FALSE yes rainy mild high FALSE yes rainy mild normal FALSE yes sunny hot high FALSE no sunny hot high TRUE no sunny mild high FALSE no sunny cool normal FALSE yes sunny mild normal TRUE yes Before: 14 records, 9 are “yes” ✓ 9 14 log2 9 14 + 5 14 log2 5 14 ◆ = 0.94 If we choose temperature: cool: 4 records, 3 are “yes” = 0.81 rainy : 4 records, 2 are “yes” =1.0 sunny : 6 records, 4 are “yes” =0.92 Expected new entropy: = 0.91 0.81(4/14) + 1.0(4/14) + 0.92(6/14)
  • 41. outlook temperature humidity windy play overcast cool normal TRUE yes overcast hot high FALSE yes overcast hot normal FALSE yes overcast mild high TRUE yes rainy cool normal TRUE no rainy mild high TRUE no rainy cool normal FALSE yes rainy mild high FALSE yes rainy mild normal FALSE yes sunny hot high FALSE no sunny hot high TRUE no sunny mild high FALSE no sunny cool normal FALSE yes sunny mild normal TRUE yes Before: 14 records, 9 are “yes” ✓ 9 14 log2 9 14 + 5 14 log2 5 14 ◆ = 0.94 If we choose humidity: normal: 7 records, 6 are “yes” = 0.59 high: 7 records, 2 are “yes” =0.86 Expected new entropy: = 0.725 0.59(7/14) + 0.86(7/14)
  • 42. outlook temperature humidity windy play overcast cool normal TRUE yes overcast hot high FALSE yes overcast hot normal FALSE yes overcast mild high TRUE yes rainy cool normal TRUE no rainy mild high TRUE no rainy cool normal FALSE yes rainy mild high FALSE yes rainy mild normal FALSE yes sunny hot high FALSE no sunny hot high TRUE no sunny mild high FALSE no sunny cool normal FALSE yes sunny mild normal TRUE yes Before: 14 records, 9 are “yes” ✓ 9 14 log2 9 14 + 5 14 log2 5 14 ◆ = 0.94 If we choose windy: TRUE: 8 records, 6 are “yes” = 0.81 FALSE: 5 records, 3 are “yes” =0.97 Expected new entropy: = 0.87 0.81(8/14) + 0.97(6/14)
  • 43. outlook temperature humidity windy play overcast cool normal TRUE yes overcast hot high FALSE yes overcast hot normal FALSE yes overcast mild high TRUE yes rainy cool normal TRUE no rainy mild high TRUE no rainy cool normal FALSE yes rainy mild high FALSE yes rainy mild normal FALSE yes sunny hot high FALSE no sunny hot high TRUE no sunny mild high FALSE no sunny cool normal FALSE yes sunny mild normal TRUE yes Before: 14 records, 9 are “yes” ✓ 9 14 log2 9 14 + 5 14 log2 5 14 ◆ = 0.94 outlook 0.94 – 0.69 = 0.25 temperature 0.94 – 0.91 = 0.03 humidity 0.94 – 0.725 = 0.215 windy 0.94 – 0.87 = 0.07 highest gain
  • 44. DOCUMENT CLASSIFICATION Includes “Falcons” yes no Includes “Mars” yes no 5% “Sports” 95% “Science” 50% “Sports” 50% “Science” 50% “Sports” 50% “Science” 50% “Sports” 50% “Science”
  • 45. BUILDING A DECISION TREE (ID3 ALGORITHM) Assume attributes are discrete •  Discretize continuous attributes Choose the attribute with the highest Information Gain Create branches for each value of attribute Examples partitioned based on selected attributes Repeat with remaining attributes Stopping conditions •  All examples assigned the same label •  No examples left Bill Howe, UW
  • 46. PROBLEMS Expensive to train Prone to overfitting • Drive to perfection on training data, bad on test data • Pruning can help: remove or aggregate subtrees that provide little discriminatory power (C45)
  • 47. C4.5 EXTENSIONS Continuous Attributes outlook temperature humidity windy play overcast cool 60 TRUE yes overcast hot 80 FALSE yes overcast hot 63 FALSE yes overcast mild 81 TRUE yes rainy cool 58 TRUE no rainy mild 90 TRUE no rainy cool 54 FALSE yes rainy mild 92 FALSE yes rainy mild 59 FALSE yes sunny hot 90 FALSE no sunny hot 89 TRUE no sunny mild 90 FALSE no sunny cool 60 FALSE yes sunny mild 62 TRUE yes
  • 48. outlook temperature humidity windy play rainy mild 54 FALSE yes overcast hot 58 FALSE yes overcast cool 59 TRUE yes rainy cool 60 FALSE yes overcast mild 60 TRUE yes overcast hot 62 FALSE yes rainy mild 63 TRUE no sunny cool 80 FALSE yes rainy mild 81 FALSE yes sunny mild 89 TRUE yes sunny hot 90 FALSE no rainy cool 90 TRUE no sunny hot 90 TRUE no sunny mild 92 FALSE no Consider every possible binary partition; choose the partition with the highest gain Expect = 8 14 (0.95)+ 6 14 (0) = 0.54 Expect = 10 14 (0.47)+ 4 14 (0) = 0.33 E( 4 4 ) = 0.0 E( 6 6 ) = 0.0 E( 3 8 )+ E( 5 9 ) = 0.95 E( 9 10 )+ E( 1 10 ) = 0.47
  • 49. WHEREWE ARE Supervised learning and classification problems •  Predict a class label based on other attributes Rules •  To start, we guessed simple rules that might explain the data •  But the relationships are complex, so we need to automate •  The 1-rule algorithm •  A sequential cover algorithm for sets of rules with complex conditions •  But: Sets of rules are hard to interpret Decision trees •  Each path from the root is a tree; easy to interpret •  Use entropy to choose best attribute at each node •  Extensions for numeric attributes •  But: Decision Trees are prone to overfitting 1/16/16 Bill Howe, Data Science, Autumn 2012 49
  • 50. OVERFITTING Bill Howe, UW 50 What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. This problem is called overfitting, and is the bugbear of machine learning.When your learner outputs a classifier that is 100% accurate on the training data but only 50% accurate on test data, when in fact it could have output one that is 75% accurate on both, it has overfit. Pedro Domingos, A Few Useful Things to Know About Machine Learning, CACM 55(10), 2012
  • 51. OVERFITTING Low error on training data and high error on test data Bill Howe, UW 51
  • 52. OVERFITTING 52 Perhaps not the most useful intuition
  • 53. OVERFITTING Bill Howe, UW 53 Overfitting.png by Dake. training set test set A better image to remember
  • 54. Is the model able to generalize? Can it deal with unseen data, or does it overfit the data? Test on hold-out data: • split data to be modeled in training and test set • train the model on training set • evaluate the model on the training set • evaluate the model on the test set • difference between the fit on training data and test data measures the model’s ability to generalize Bill Howe, UW 54 slide src: Frank Keller
  • 55. Bill Howe, UW 55 src: Domingo 2012 Underfitting: High Bias Low Variance Overfitting: Low Bias High Variance
  • 56. EVALUATION Division into training and test sets • Fixed •  Leave out random N% of the data • k-fold Cross-Validation •  Select K folds without replace • Leave-One-Out Cross Validation •  Special case • Related: Bootstrap •  Generate new training sets by sampling with replacement Bill Howe, UW 56
  • 57. LEAVE-ONE-OUT CROSS-VALIDATION (LOOCV) Bill Howe, UW 57 For each training example (xi, yi) Train a classifier with all training data except (xi, yi) Test the classifier’s accuracy on (xi, yi) LOOCV accuracy = average of all n accuracies
  • 58. ACCURACY Bill Howe, UW 58 Predicted + Predicted - True + a b True - c d Confusion Matrix Accuracy = a + d a + b+c+ d
  • 59. EVALUATION:ACCURACY ISN’T ALWAYS ENOUGH How do you interpret 90% accuracy? •  You can’t; it depends on the problem Need a baseline: •  Base Rate •  Accuracy of trivially predicting the most-frequent class •  Random Rate •  Accuracy of making a random class assignment •  Might apply prior knowledge to assign random distribution •  Naïve Rate •  Accuracy of some simple default or pre-existing model •  Ex:“All females survived” Bill Howe, UW 59
  • 60. ACCURACY Bill Howe, UW 60 Predicted Positive Predicted Negative True Positive a b True Negative c d Confusion Matrix Accuracy = a + d a + b+c+ d a a + b a +c a + b+c+ d lift = %positives > threshold %dataset > threshold
  • 61. ACCURACY Bill Howe, UW 61 Confusion Matrix Accuracy = a + d a + b+c+ dPredicted Positive Predicted Negative True Positive a b True Negative c d precision : a a +c recall : a a + b
  • 62. ROC PLOT Bill Howe, UW 62 Confusion Matrix “Receiver Operator Characteristic” •  Historical term from WW2 •  Used to measure accuracy of radar operators Predicted Positive Predicted Negative True Positive a b True Negative c d Accuracy = a + d a + b+c+ d sensitivity = a a + b 1− specificity = 1− d c+ d
  • 63. Bill Howe, UW 63 src: Cesar Souza
  • 64. THE BOOTSTRAP Bill Howe, UW 64 Given a dataset of size N Draw N samples with replacement to create a new dataset Repeat ~1000 times You now have ~1000 sample datasets •  All drawn from the same population •  You can compute ~1000 sample statistics •  You can interpret these as repeated experiments, which is exactly what the frequentist perspective calls for Very elegant use of computational resources Efron 1979
  • 65. BOOTSTRAP EXAMPLE Bill Howe, UW 65 1 2 3 4 5 6 3.5 4 3 4 2 1 6 3.33 2 3 6 1 3 5 3.33 5 1 1 2 3 6 3.00 2 5 2 6 3 4 3.67 4 4 4 2 1 3 3.00 3 4 5 3 2 1 3.00 1 2 3 6 6 1 3.17 5 2 3 1 4 5 3.33 mean
  • 66. THE BOOTSTRAP Bill Howe, UW 66 Example: Generate 1000 samples and 1000 linear regressions You want a 90% confidence interval for the slope? Just take the 5th percentile and the 95th percentile!
  • 67. ENSEMBLES:COMBINING CLASSIFIERS Can a set of weak classifiers be combined to derive a strong classifier? Yes! Average results from different models Why? •  Better classification performance than individual classifiers •  More resilience to noise Why not? •  Time consuming •  Models become difficult to explain Bill Howe, UW 4 “Wisdom of the (simulated) crowd” Freund, Schapire 1995
  • 68. BAGGING Draw N bootstrap samples Retrain the model on each sample Average the results • Regression: Averaging • Classification: Majority vote Works great for overfit models • Decreases variance without changing bias • Doesn’t help much with underfit/high bias models •  Insensitive to the training data Bill Howe, UW 5
  • 69. BOOSTING Bill Howe, UW 6 Instead of selecting data points randomly with the bootstrap, favor the misclassified points Initialize the weights Repeat: Resample with respect to weights Retrain the model Recompute weights
  • 70. Bill Howe, UW 7 ✏t = X i:ht(xi)6=yi Dt(i) Dt(i) For each step t weights: probability of selecting example i in the sample ht trained classifier at step t using sample drawn according to Dt xi, yi ith example, ith label t = ✏t 1 ✏t sum of weights for misclassified examples odds of misclassifying adjust weights down for correctly classified examples …and normalize to make sure weights sum to 1 Dt+1(i) = tDt(i)
  • 71. RANDOM FOREST ALGORITHM Bill Howe, UW 71 Repeat k times: •  Draw a bootstrap sample from the dataset •  Train a decision tree Until the tree is maximum size Choose next leaf node Select m attributes at random from the p available Pick the best attribute/split as usual •  Measure out-of-bag error •  Evaluate against the samples that were not selected in the bootstrap •  Provides measures of strength (inverse error rate), correlation between trees (which increases the forest error rate), and variable importance Make a prediction by majority vote among the k trees Breiman 2001
  • 72. RANDOM FORESTS:VARIABLE IMPORTANCE Key Idea: If you scramble the values of a variable and the accuracy of your tree doesn’t change much, then the variable isn’t very important Measure the error increase Random Forests are more difficult to interpret than single trees; understanding variable importance helps •  Ex: Medical applications can’t typically rely on black box solutions Bill Howe, UW 72 Breiman 2001
  • 73. GINI COEFFICIENT Entropy captured an intuition for “impurity” • We want to choose attributes that split records into pure classes The gini coefficient measures inequality Bill Howe, UW 73
  • 74. RANDOM FORESTS ON BIG DATA Easy to parallelize • Trees are built independently Handles “small n big p” problems naturally • A subset of attributes are selected by importance Bill Howe, UW 74
  • 75. SUMMARY:DECISION TREES AND FORESTS Representation • Decision Trees • Sets of decision trees with majority vote Evaluation • Accuracy • Random forests: out-of-bag error Optimization • Information Gain or Gini Index to measure impurity and select best attributes Bill Howe, UW 75
  • 77. NEAREST NEIGHBORS INTUITION The last document I saw that mentioned “Falcons” and “Saints” was about Sports, so I’ll classify this document as about Sports too Bill Howe, UW 77
  • 78. NEAREST NEIGHBOR CHOICES k nearest neighbors – how do we choose k? •  Benefits of a small k? Benefits of a large k? Similarity function •  Euclidean distance? Cosine similarity? Bill Howe, UW 78 Large k = bias towards popular labels Large k = ignores outliers Small k = fast Cosine = favors dominant components Euclidean = difficult to interpret with sparse data, and high-dimensional data is always sparse
  • 80. REGRESSION-BASED METHODS Trees and rules are designed for categorical data • Numerical data can be discretized, but this introduces another decision Discriminitive models Bill Howe, UW 80
  • 82. EVALUATION:MEAN SQUARED ERROR Bill Howe, UW 82 : number of instances in the data set H0 i Hi MSE = 1 n nX i=1 (Hi H0 i)2 : quantity observed in the data set : quantity predicted by model n
  • 83. ASIDE ON ERRORS AND NORMS Bill Howe, UW 83 X i Hi H0 i errors cancel out; usually not what you want X i |Hi H0 i| Asserts that 1 error of 7 units is as bad as 7 errors of 1 unit each X i (Hi H0 i)2 Asserts that 1 error of 7 units is as bad as 49 errors of 1 unit each 1 n nX i (Hi H0 i)2 Average squared error per data point; useful when comparing methods that filter the data differently L1-norm L2-norm not a norm
  • 84. TERMINOLOGY:NORMS Bill Howe, UW 84 Lp -norm = nX i=1 |xi|p !1 p Norm: any function that assigns a strictly positive number to every non-zero vector
  • 86. SLIDES CAN BE FOUND AT: TEACHINGDATASCIENCE.ORG