Visualizing the model selection process

Visualizing the Model Selection
Process
An Introduction to Yellowbrick

Benjamin Bengfort
Twitter: twitter.com/bbengfort
LinkedIn: linkedin.com/in/bbengfort
Github: github.com/bbengfort
Email: bbengfort@districtdatalabs.com

Rebecca Bilbro
Twitter: twitter.com/rebeccabilbro
LinkedIn: linkedin.com/in/rebeccabilbro
Github: github.com/rebeccabilbro
Email: rbilbro@districtdatalabs.com

The Model Selection Triple
Arun Kumar http://bit.ly/2abVNrI
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning

- Define a bounded, high
dimensional feature space
that can be effectively
modeled.
- Transform and manipulate
the space to make
modeling easier.
- Extract a feature
representation of each
instance in the space.
Feature
Analysis

Algorithm
Selection
- Select a model family that
best/correctly defines the
relationship between the
variables of interest.
- Define a model form that
specifies exactly how
features interact to make a
prediction.
- Train a fitted model by
optimizing internal
parameters to the data.

Hyperparameter
Tuning
- Evaluate how the model
form is interacting with the
feature space.
- Identify hyperparameters
(i.e. parameters that affect
training or the prior, not
prediction)
- Tune the fitting and
prediction process by
modifying these params.

Automatic Model Selection Criteria
from sklearn.cross_validation import KFold
kfolds = KFold(n=len(X), n_folds=12)
scores = [
model.fit(
X[train], y[train]
).score(
X[test], y[test]
)
for train, test in kfolds
]
F1
R2

Try Them All!
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation as cv
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = cv.KFold(len(X), n_folds=12)
max([
cv.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])

Gridsearch
Search is difficult particularly in
high dimensional space.
Even with techniques like
genetic algorithms or particle
swarm optimization, there is no
guarantee of a solution.
As the search space gets larger,
the amount of time increases
exponentially.

Visual Steering
- Interventions or guidance
by human pattern
recognition.
- Humans engage the
modeling process
through visualization.
- Overview first, zoom and
filter, details on demand.

Visualizing Model Selection with Yellowbrick

What is Yellowbrick?
- Model Visualization
- Data Visualization for
Machine Learning
- Visual Diagnostics
- Visual Steering
Not a replacement for
visualization libraries.

Enhance the Model Selection Process

Yellowbrick Extends the Scikit-Learn API

Estimators
The main API implemented
by Scikit-Learn is that of the
estimator. An estimator is
any object that learns from
data;
it may be a classification,
regression or clustering
algorithm, or a transformer
that extracts/filters useful
features from raw data.
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred

Transformers
Transformers are special
cases of Estimators --
instead of making
predictions, they transform
the input dataset X to a new
dataset X’.
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime

Visualizers
A visualizer is an estimator
that produces visualizations
based on data rather than
new datasets or predictions.
Visualizers are intended to
work in concert with
Transformers and Estimators
to shed light onto the
modeling process.
class Visualizer(Estimator):
def draw(self):
"""
Draw the data
"""
self.ax.plot()
def finalize(self):
"""
Complete the figure
"""
self.ax.set_title()
def poof(self):
"""
Show the figure
"""
plt.show()

Scikit-Learn Estimator Interface
# Import the estimator
from sklearn.linear_model import Lasso
# Instantiate the estimator
model = Lasso()
# Fit the data to the estimator
model.fit(X_train, y_train)
# Generate a prediction
model.predict(X_test)

Yellowbrick Visualizer Interface
# Import the model and visualizer
from sklearn.linear_model import Lasso
from yellowbrick.regressor import PredictionError
# Instantiate the visualizer
visualizer = PredictionError(Lasso())
# Fit
visualizer.fit(X_train, y_train)
# Score and visualize
visualizer.score(X_test, y_test)
visualizer.poof()

Parallel Coordinates for 5 Features

Pearson Ranking of 23 Features

Covariance Ranking of 23 Features

Frequency Distribution of Top 50 Tokens

TSNE Projection of the Baleen Corpus

Class Balance for RandomForestClassifier

GaussianNB Classification Report

Logistic Regression ConfusionMatrix

Silhouette Plot of K-Means Clusterer

Visualizing the model selection process

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Visualizing the model selection process

Similar to Visualizing the model selection process (20)

More from Rebecca Bilbro

More from Rebecca Bilbro (14)

Recently uploaded

Recently uploaded (20)

Visualizing the model selection process