Valencian Summer School 2015
Day 2
Lecture 15
Machine Learning - Black Art
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
2. Machine Learning is Hard!
• By now, you know kind of a lot
• Different types of models
• Feature engineering
• Ways to evaluate
• But you’ll still fail!
• Out in the real world, there’s a
whole bunch of things that will kill
your project
• FYI - A lot of these talks are stolen
2
3. Join Me!
• On a journey into the Machine Learning House of
Horrors!
• Mwa ha ha!
3
4. 5
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
5. Choosing A Hypothesis Space
• By “hypothesis space” we
mean the possible classifiers
you could build with an
algorithm given the data
• This is the choice you make
when you pick a learning
algorithm
• You have one job!
• Is there any way to make it
easier?
6
6. Theory to The Rescue!
• Probably Approximately Correct
• We’d like our model to have error less than epsilon
• We’d like that to happen at least some percentage of the time
• If the error is epsilon, the percentage is sigma, the number of
training examples is m, and the hypothesis space size is d:
7
7. The Triple Trade-Off
• There is a triple-trade off between the error, the size
of the hypothesis space, and the amount of training
data you have
8
Error
Hypothesis Space Training Data
8. What About Huge Data?
• I’m clever, so I’ll use non-
parametric methods (Decision
tree, k-NN, kernelized SVMs)
• As data scales, curious things
tend to happen
• Simpler models become more
desirable as they’re faster to fit.
• You can increase model
complexity by adding features
(maybe word counts)
• Big data often trumps modeling!
9
9. 10
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
10. A Dirty Little Secret About ML Algorithms
• They don’t care what you want
• Decision Trees:
• SVM:
• LR:
• LDA:
11
11. Real-world Losses
• Real losses are nothing like this
• False positive in disease
diagnosis
• False positive in face
detection
• False positive in thumbprint
identification
• Some aren’t even instance-
based
• Path dependencies
• Game playing
12
12. Specializing Your Loss
• One solution is to let developers apply their own loss
• This is the approach of SVM light:
http://svmlight.joachims.org/
It’s been around for a while
• Losses other than Mutual Information can be plugged into the appropriate
place in splitting code
• Models trained via gradient descent can obviously be customized (Python’s
Theano is interesting for this)
• In the case of multi-example loss function, we have SEARN in Vowpal Wabbit
https://github.com/JohnLangford/vowpal_wabbit
13
13. Other Hackery
• Sometimes, the solution is just to hack
around the actual prediction
• Have several levels (cascade) of
classifiers in e.g., medical diagnosis, text
recognition
• Apply logic to explicitly avoid high loss
cases (e.g., when buying/selling equities)
• Changing the problem setting
• Will you be doing queries? Use ranking
or metric learning
• “I want to do crazy thing x with
classifiers”, chances are it’s already been
done and you can read about it.
14
14. 15
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
15. When Validation Attacks!
• Cross validation
• n-Fold - Hold out one fold for
testing, train on n - 1 folds
• Great way to measure
performance, right?
• It’s all about information leakage
• via instances
• via features
16
16. Case Study #1: Law of Averages
• Estimate sporting event
outcomes
• Use previous games to
estimate points scored for
each team (via windowing
transform)
• Choose winner based on
predicted score
• What if you’re off by one on
the window?
17
17. Case Study #2: Photo Dating
• Take scanned photos from
30 different users (on
average 200 per user) and
create a model to assign a
date taken (plus or minus
five years)
• Perform 10-cross
validation
• Accuracy is 85%. Can
you trust it?
18
18. Case Study #3: Moments In Time
• You have a buy/sell
opportunity every five
seconds
• The signals you use to
evaluate the opportunity
are aggregates of market
activity over the last five
minutes
• How careful must you be
with cross-validation?
19
19. 20
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
20. Breaking Machine Learning
• You’ve got this great model!
Congratulations!
• Suddenly it stops working.
Why?
• You might be in a domain
that tends to change over
time (document classification,
sales prediction)
• You might be experiencing
adverse selection (market
data predictions, spam)
21
21. Concept Drift
• This is called non-stationarity in either the prior or the conditional
distributions
• Could be a couple of different things
• If the prior p(input) is changing, it’s covariate shift
• If the conditional p(output | input) is changing, it’s concept drift
• No rule that it can’t be both
• http://blog.bigml.com/2013/03/12/machine-learning-from-
streaming-data-two-problems-two-solutions-two-concerns-and-
two-lessons/
22
22. Take Action!
• First: Look for symptoms
• Getting a lot of errors
• The distribution of predicted values changes
• Drift detection algorithms (that I know about) have the same basic flavor:
• Buffer some data in memory
• If recent data is “different” from past data, retrain, update or give up
• Some resources - A nice survey paper and an open source package:
23
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf
http://moa.cms.waikato.ac.nz/
23. The Benefits of Archeology
• Why might you train on old
data, even if it’s not relevant?
• Verification of your research
process
• You’d do the same thing
last year. Did it work?
• Gives you a good idea of
how much drift you should
expect
24
24. 25
• The Horror of The Huge Hypothesis Space
• The Perils of The Poorly Picked Loss Function
• The Creeping Creature Called Cross Validation
• The Dread of the Drifting Domain
• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
25. Publish or Perish
• Academic papers are a certain type of
result
• Show incremental improvement in
accuracy or generality
• Prove something about your
algorithm
• This latter is hard to come by as results
get more realistic
• Machine learning proofs assume data
is “i.i.d”, but this is obviously false.
• Real world data sucks, and dealing
with that significantly changes the
dataset
26
26. Usefulness of Results
• Theoretical Results
• Most of the time bounds do not apply (error, sample
complexity, convergence)
• Sometimes they don’t even make any sense
• Beware of putting too much faith in a single person or single
person’s work
• Usefulness generally occurs only in the aggregate
• And sometimes not even then (researchers are people, too)
27
27. Machine Learning Isn’t About Machine Learning
• Why doesn’t it work like in the
paper?
• Remember, the paper is carefully
controlled in a way your application
is not.
• Performance is rarely driven by
machine learning
• It’s driven by camera
microphones
• It’s driven by Mario Draghi
28
28. So, Don’t Bother With It?
• Of course not!
• What’s the alternative?
• “All our science, measured
against reality, is primitive
and childlike — and yet it is
the most precious thing we
have” - Albert Einstein
• Use academia as your
starting point, but don’t
think it will get you out of
the work
29
29. Some Themes
• The major points of this talk:
• Machine learning is hard to get right
• The algorithms won’t do what you want
• Good results are probably spurious
• Even if they aren’t, it won’t last
• Reading the research won’t help
• Wait, no!
• Have an attitude of skeptical optimism (or optimal skepticism?)
30