In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
5. from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection as ms
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = ms.KFold(len(X), n_folds=12)
max([
ms.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
Try them all!
7. ● Search is difficult, particularly in
high dimensional space.
● Even with clever optimization
techniques, there is no guarantee of
a solution.
● As the search space gets larger, the
amount of time increases
exponentially.
Except ...
10. Enter Yellowbrick
● Extends the Scikit-Learn API.
● Enhances the model selection
process.
● Tools for feature visualization,
visual diagnostics, and visual
steering.
● Not a replacement for other
visualization libraries.
11. # Import the estimator
from sklearn.linear_model import Lasso
# Instantiate the estimator
model = Lasso()
# Fit the data to the estimator
model.fit(X_train, y_train)
# Generate a prediction
model.predict(X_test)
Scikit-Learn Estimator Interface
12. # Import the model and visualizer
from sklearn.linear_model import Lasso
from yellowbrick.regressor import PredictionError
# Instantiate the visualizer
visualizer = PredictionError(Lasso())
# Fit
visualizer.fit(X_train, y_train)
# Score and visualize
visualizer.score(X_test, y_test)
visualizer.poof()
Yellowbrick Visualizer Interface
14. Is this room occupied?
● Given labelled data
with amount of light,
heat, humidity, etc.
● Which features are
most predictive?
● How hard is it going to
be to distinguish the
empty rooms from the
occupied ones?
19. Why isn’t my model predictive?
● What to do with a low-
accuracy classifier?
● Check for class
imbalance.
● Visual cue that we
might try stratified
sampling,
oversampling, or
getting more data.
23. What’s the right k?
● How many clusters do
you see?
● How do you pick an
initial value for k in k-
means clustering?
● How do you know
whether to increase or
decrease k?
● Is partitive clustering
the right choice?
33. The main API implemented by
Scikit-Learn is that of the
estimator. An estimator is any
object that learns from data;
it may be a classification,
regression or clustering
algorithm, or a transformer that
extracts/filters useful features
from raw data.
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
Estimators
34. Transformers are special
cases of Estimators --
instead of making
predictions, they transform
the input dataset X to a new
dataset X’.
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
Transformers
35. A visualizer is an estimator that
produces visualizations based
on data rather than new
datasets or predictions.
Visualizers are intended to
work in concert with
Transformers and Estimators to
shed light onto the modeling
process.
class Visualizer(Estimator):
def draw(self):
"""
Draw the data
"""
self.ax.plot()
def finalize(self):
"""
Complete the figure
"""
self.ax.set_title()
def poof(self):
"""
Show the figure
"""
plt.show()
Visualizers
Good morning and thank you for coming!
Stickers!
Today I’d like to tell you about an open source Python project I’ve been working on for the last two years.
It’s called Yellowbrick, and it’s a tool you can use to steer the machine learning process using visual transformers.
================================================== TALK DESC/REFERENCE==================================================
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
Ben and Rebecca are active contributors to the open source community, and this talk is based on Yellowbrick, a project they've been building together with the team at District Data Labs, an open source collaborative in Washington, DC. They are also co-authors of the forthcoming O'Reilly book, Applied Text Analysis with Python and organizers for Data Community DC - a not-for-profit organization of 9 meetups that organizes free monthly events and lectures for the local data community in Washington, DC.
Visualizing the Model Selection Process
1. The Model Selection Process
b. YB is part of the model selection workflow
b. Feature Selection
c. Algorithm Selection (Model Family/Form ⟶ Fitted Model)
d. Hyperparameter Tuning
e. Selection with Cross-Validation
f. Try all the models!
g. Hyperparameter Tuning
h. Model selection is Search
3. Main Point: Visual Steering Improves Model Selection:
a. Leads to better models (better F1/R2 Scores)
b. Gets to good models more quickly
c. Produces more insight about models
d. Research People: prove this.
4. The Scikit-Learn API
a. YB _extends_ the Scikit-Learn API
b. Tricky because: functional/procedural matplotlib + OO
b. Estimators
c. Transformers
d. Visualizers
e. Pipelines
f. Visual Workflows and Pipelines
5. Primary YB Requirements
a. Fits into the sklearn API and workflow
b. Implements matplotlib calls efficiently
c. Low overhead if `poof()` is not called
d. Just flexible enough for users to adapt to their data
e. Easy to add new visualizers
f. Looks as good as Seaborn
g. Minimal dependencies: sklearn, numpy, matplotlib -- c'est tout!
h. Primary Requirement: Implements Visual Steering
6. The Visualizer
a. Current class hierarchy
b. The Visualizer Interface
c. Axes management
d. `draw()`, `poof()`, and `finalize()`
e. A simple example of a visualizer
f. Feature Visualizers
g. Score Visualizers
h. Multi-Estimator Visualizers
9. Visual Pipelines
a. Multiple Visualizations
b. Interactivity
10. Optimizing Visualization!
11. Utilities
a. Style management: Sequences, Palettes, Color Codes
b. Best Fit Lines
c. Type Detection (is_classifier, etc.)
d. Exceptions
12. Documentation and Sphinx
13. Contributing
a. Git/Branch Management
b. Issues, milestones, and labels
c. Waffle Board
13. User Testing and Research
https://flic.kr/p/5EsRYP
It all started in an ivory tower somewhere.
Once upon a time, you had to go to school to learn machine learning.
And you’d spend years and ultimately specialize in a particular model family
Bayesian methods
Gaussian processes
support vector machines
...and get very very good at tuning those models.
Then everything got really easy really fast.
In 2010, Scikit-Learn was publicly released (by INRIA).
And it started to grow very very quickly.
Suddenly there were dozens of models at the fingertips of any Python programmer.
Scikit-Learn has so much going for it
Models, models, models
Also transformers, vectorization tools, sample datasets
Pipelines
But arguably the best part is the consistent API
You can plug the same data into nearly any of the models and it will work!
So it becomes just an optimization problem
You can loop through all the models and just pick the one with the best score.
It almost doesn’t matter which model you use, or why or how it’s working.
http://derekerdman.com/ilovemilkshakes/january2009/DO_IT/haircuts_try_them_all.jpg
Slide showing MLaaS
Except that hyperparameter space is large and gridsearch is slow if you don’t know already what you’re looking for
Alpha/penalty for regularization
Kernel function in support vector machine
Leaves or depth of a decision tree
Neighbors used in a nearest neighbor classifier
Clusters in a k-means clustering
https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Blindfold_%28PSF%29.png/1200px-Blindfold_%28PSF%29.png
The answer is visual steering.
Leverage human pattern recognition and engagement.
Get to better models, faster, with more insight,
Overview first; zoom and filter; details on demand.
Enhance model selection and evaluation
Goal is not to replace other libraries.
Model visualization not data visualization
YB extends the Scikit-Learn API
This is a high dimensional visualization problem.
For classification; potentially we want to see if there is good separability
Are some features more predictive than others?
Unit circle requires normalization
Drop points on the middle and are pulled out to outer edges
Ordering of features matters
Automatic ordering with optimization to minimize amt of overlap
Feature engineering requires understanding of the relationships between features
Visualize pairwise relationships
Heatmap
Pearson shows us strong correlations => potential collinearity
Covariance helps us understand the sequence of relationships
Other correlation metrics - we just used the ones that were implemented in numpy, but looking to expand
Frequency distribution - top 50 tokens
Stochastic Neighbor Embedding, decomposition then projection into 2D scatterplot
Visual part-of-speech tagging
YB extends the Scikit-Learn API
Can we quickly detect class imbalance issues
Stratified sampling, oversampling, getting more data -- tricks will help us balance
But supervised methods can mask training data; simple graphs like these give us an at-a-glance reference
As this gets into multiclass problems, domination could be harder to see and really effect modeling
Receiver operating characteristics/area under curve
Class imbalance
Classification report heatmap - Quickly identify strengths & weaknesses of model - F1 vs Type I & Type II error
Visual confusion matrix - misclassification on a per-class basis
Where/why/how is model performing good/bad
Prediction error plot - 45 degree line is theoretical perfect
Residuals plot - 0 line is no error
See change in amount of variance between x and y, or along x axis => heteroscedasticity
YB extends the Scikit-Learn API
Which regularization technique to use? Lasso/L1, Ridge/L2, or ElasticNet L1+L2
Regularization uses a Norm to penalize complexity at a rate, alpha
The higher the alpha, the more the regularization.
Complexity minimization reduces bias in the model, but increases variance
Goal: select the smallest alpha such that error is minimized
Visualize the tradeoff
Surprising to see: higher alpha increasing error, alpha jumping around, etc.
Embed R2, MSE, etc into the graph - quick reference
Want to contribute?
Here’s some information about how the Yellowbrick API works
YB extends the Scikit-Learn API
Where to hook in?
Estimators learn from data
Have a fit and predict method
Transformers transform data
Have a transform method
Visualizers can be estimators or transformers
Generally have a draw, finalize, and poof method
Contribute!
Needs: new features, testers, blog posts