1) Bayesian parameter optimization uses machine learning to predict the performance of untrained models based on parameters from previous models to efficiently search the parameter space.
2) However, there are still important issues like choosing the right evaluation metric, ensuring no information leakage between training and test data, and selecting the appropriate model for the problem and available data.
3) Automated model selection requires sufficient data to make accurate predictions; with insufficient data, the process can fail.
4. #MLSEV 4
Parameter Optimization
• There are lots of algorithms and lots of parameters
• We don’t have time to try even close to everything
• If only we had a way to make a prediction . . .
Did I hear someone say
Machine Learning?
5. #MLSEV 5
The Allure of ML
“Why don’t we just use
machine learning to predict
the quality of a set of
modeling parameters before
we train a model on them?”
— Every first year ML grad student ever
6. #MLSEV 6
Bayesian Parameter Optimization
• The performance of an ML algorithm (with associated parameters) is data
dependent
• So: Learn from your previous attempts
• Train a model, then evaluate it
• After you’ve done a number of evaluations, learn a regression model to predict the
performance of future, as-yet-untrained models
• Use this classifier to chose a promising set of “next models” to evaluate
14. #MLSEV 14
Wow, Magic!
• So all of my problems are solved, right?
NO NO NO
• First, you’re selecting a model based on
held out data so you have to have
enough data to do an accurate model
selection
• Second, there are still important
remaining issues and possible ways to
screw up
16. #MLSEV 16
Driving The Search
• So how do we measure the peformance of
each model, to figure out what to do next?
• If we choose the wrong metric, we’ll get
models that are the best at something that we
don’t really care about
• But there are so many metrics! How do we
choose the right one?
• Hmmmm, all of this sounds awfully familiar . . .
17. #MLSEV 17
Flashback #1
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Remember, only 1 in 1000 have the disease
• A silly model which always predicts “well” is 99.9% accurate
18. #MLSEV 18
A Metric Selection Flowchart
Will you bother about
threshold setting?
Is your dataset
imbalanced?
Is yours a
“ranking” problem?
Do you care
more about the top-
ranked instances?
Phi coefficient
f-mesure Accuracy
Max. Phi
KS-statistic
Area Under the ROC / PR curve
Kendall’s Tau
Spearman’s Rho
Yes
Yes
Yes
No
No
No
Yes
No
21. #MLSEV 21
Is Cross-Validation Right for You?
• Cross-validation is a good tool some of the time
• Many Other times, it is disastrously bad (overly optimistic)
• This is why BigML offers the option for a specific holdout set.
• Should you use it?
22. #MLSEV 22
Flashback #2
• Okay, so I’m not testing on the training
data, so I’m good, right? NO NO NO
• You also have to worry about information
leakage between training and test data.
• What is this? Let’s try to predict the daily
closing price of the stock market
• What happens if you hold out 10 random
days from your dataset?
• What if you hold out the last 10 days?
23. #MLSEV 23
Flashback #3
• This is common when you have time-distributed
data, but can also happen in other instances:
• Let’s say we have a dataset of 10,000 pictures
from 20 people, each labeled with the year it which
it was taken
• We want to predict the year from the image
• What happens if we hold out random data?
• Solution: Hold out users instead
24. #MLSEV 24
Again, Take Care!
• These situations are very common in all cases
where data comes in groups (days, users, etc.)
• The solution is to hold out whole groups of data
• It’s possible that it isn’t a problem in your
dataset, but when in doubt, try both!
26. #MLSEV 26
Which Model is Best?
• Performance isn’t the only issue!
• Retraining: Will the amount of data you have be different in the future?
• Fit stability: How confident must you be that the model’s behavior is
invariant to small data changes?
• Prediction speed: The difference can be orders of magnitude
27. #MLSEV 27
Flashback #4
Amount of data required Linear models < trees, ensembles < deep learning
Potential to overfit Linear models < ensembles < trees, deep learning
Speed Linear models, trees < ensembles < deep learning
Representational Power Linear models < trees < ensembles < deep learning
• How much data do you have
• How fast do you need things to go?
• How much performance do you really need?
29. #MLSEV 29
Summary
• We can do some simple tricks and use
machine learning to help us search
through the space of possible models
• Even with this however, there is still
lots of work domain expert
• Automated model selection relies on
data. If you don’t have enough, it will
go poorly!