SlideShare uma empresa Scribd logo
1 de 37
Escaping the Black Box
Yellowbrick: A Visual API for
Machine Learning
Once upon a time ...
And then things got ...
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection as ms
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = ms.KFold(len(X), n_folds=12)
max([
ms.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
Try them all!
Data
Automated
Build
Insight
Machine Learning as a Service
● Search is difficult, particularly in
high dimensional space.
● Even with clever optimization
techniques, there is no guarantee of
a solution.
● As the search space gets larger, the
amount of time increases
exponentially.
Except ...
Hyperparameter
Tuning
Algorithm
Selection
Feature
Analysis
The “Model Selection Triple”
Arun Kumar, et al.
Solution: Visual Steering
Enter Yellowbrick
● Extends the Scikit-Learn API.
● Enhances the model selection
process.
● Tools for feature visualization,
visual diagnostics, and visual
steering.
● Not a replacement for other
visualization libraries.
# Import the estimator
from sklearn.linear_model import Lasso
# Instantiate the estimator
model = Lasso()
# Fit the data to the estimator
model.fit(X_train, y_train)
# Generate a prediction
model.predict(X_test)
Scikit-Learn Estimator Interface
# Import the model and visualizer
from sklearn.linear_model import Lasso
from yellowbrick.regressor import PredictionError
# Instantiate the visualizer
visualizer = PredictionError(Lasso())
# Fit
visualizer.fit(X_train, y_train)
# Score and visualize
visualizer.score(X_test, y_test)
visualizer.poof()
Yellowbrick Visualizer Interface
How do I select the right
features?
Is this room occupied?
● Given labelled data
with amount of light,
heat, humidity, etc.
● Which features are
most predictive?
● How hard is it going to
be to distinguish the
empty rooms from the
occupied ones?
Yellowbrick Feature Visualizers
Use radviz or parallel
coordinates to look for
class separability
Yellowbrick Feature Visualizers
Use Rank2D for
pairwise feature
analysis
…for text, too!
Visualize top tokens,
document distribution
& part-of-speech
tagging
What’s the best model?
Why isn’t my model predictive?
● What to do with a low-
accuracy classifier?
● Check for class
imbalance.
● Visual cue that we
might try stratified
sampling,
oversampling, or
getting more data.
Yellowbrick Score Visualizers
Visualize
accuracy
and begin to
diagnose
problems
Visualize the
distribution of error to
diagnose
heteroscedasticity
Yellowbrick Score Visualizers
How do I tune this model?
What’s the right k?
● How many clusters do
you see?
● How do you pick an
initial value for k in k-
means clustering?
● How do you know
whether to increase or
decrease k?
● Is partitive clustering
the right choice?
Hyperparameter Tuning
higher silhouette scores
mean denser, more
separate clusters
The elbow
shows the
best value
of k…
Or
suggests a
different
algorithm
Hyperparameter Tuning
Should I use
Lasso, Ridge, or
ElasticNet?
Is regularization
even working?
What’s next?
Some ideas...
How does token frequency
change over time, in
relation to other tokens?
Some ideas...
View the corpus
hierarchically, after
clustering
Some ideas... Perform token
network
analysis
Some ideas...
Plot token co-
occurrence
Some ideas...
Get an x-ray
of the text
Do you have an idea?
The main API implemented by
Scikit-Learn is that of the
estimator. An estimator is any
object that learns from data;
it may be a classification,
regression or clustering
algorithm, or a transformer that
extracts/filters useful features
from raw data.
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
Estimators
Transformers are special
cases of Estimators --
instead of making
predictions, they transform
the input dataset X to a new
dataset X’.
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
Transformers
A visualizer is an estimator that
produces visualizations based
on data rather than new
datasets or predictions.
Visualizers are intended to
work in concert with
Transformers and Estimators to
shed light onto the modeling
process.
class Visualizer(Estimator):
def draw(self):
"""
Draw the data
"""
self.ax.plot()
def finalize(self):
"""
Complete the figure
"""
self.ax.set_title()
def poof(self):
"""
Show the figure
"""
plt.show()
Visualizers
Thank you!
Twitter: twitter.com/rebeccabilbro
Github: github.com/rebeccabilbro
Email:
rebecca.bilbro@bytecubed.com

Mais conteúdo relacionado

Mais procurados

GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentation
oldmanpat
 

Mais procurados (20)

General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentation
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 

Semelhante a Escaping the Black Box

Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recap
butest
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
Daniel Glauser
 

Semelhante a Escaping the Black Box (20)

Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
4. Classification.pdf
4. Classification.pdf4. Classification.pdf
4. Classification.pdf
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recap
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Nearest neighbour algorithm
Nearest neighbour algorithmNearest neighbour algorithm
Nearest neighbour algorithm
 
gan.pdf
gan.pdfgan.pdf
gan.pdf
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
It's Not Magic - Explaining classification algorithms
It's Not Magic - Explaining classification algorithmsIt's Not Magic - Explaining classification algorithms
It's Not Magic - Explaining classification algorithms
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 
Reading group gan - 20170417
Reading group   gan - 20170417Reading group   gan - 20170417
Reading group gan - 20170417
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
Optimization
OptimizationOptimization
Optimization
 
Linear regression
Linear regressionLinear regression
Linear regression
 
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
 

Mais de Rebecca Bilbro

Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
Rebecca Bilbro
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Rebecca Bilbro
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword Corpus
Rebecca Bilbro
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
Rebecca Bilbro
 

Mais de Rebecca Bilbro (15)

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in Production
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual Consistency
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big Models
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in space
Words in spaceWords in space
Words in space
 
Camlis
CamlisCamlis
Camlis
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword Corpus
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
 

Último

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Escaping the Black Box

  • 1. Escaping the Black Box Yellowbrick: A Visual API for Machine Learning
  • 2.
  • 3. Once upon a time ...
  • 4. And then things got ...
  • 5. from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import model_selection as ms classifiers = [ KNeighborsClassifier(5), SVC(kernel="linear", C=0.025), RandomForestClassifier(max_depth=5), AdaBoostClassifier(), GaussianNB(), ] kfold = ms.KFold(len(X), n_folds=12) max([ ms.cross_val_score(model, X, y, cv=kfold).mean for model in classifiers ]) Try them all!
  • 7. ● Search is difficult, particularly in high dimensional space. ● Even with clever optimization techniques, there is no guarantee of a solution. ● As the search space gets larger, the amount of time increases exponentially. Except ...
  • 10. Enter Yellowbrick ● Extends the Scikit-Learn API. ● Enhances the model selection process. ● Tools for feature visualization, visual diagnostics, and visual steering. ● Not a replacement for other visualization libraries.
  • 11. # Import the estimator from sklearn.linear_model import Lasso # Instantiate the estimator model = Lasso() # Fit the data to the estimator model.fit(X_train, y_train) # Generate a prediction model.predict(X_test) Scikit-Learn Estimator Interface
  • 12. # Import the model and visualizer from sklearn.linear_model import Lasso from yellowbrick.regressor import PredictionError # Instantiate the visualizer visualizer = PredictionError(Lasso()) # Fit visualizer.fit(X_train, y_train) # Score and visualize visualizer.score(X_test, y_test) visualizer.poof() Yellowbrick Visualizer Interface
  • 13. How do I select the right features?
  • 14. Is this room occupied? ● Given labelled data with amount of light, heat, humidity, etc. ● Which features are most predictive? ● How hard is it going to be to distinguish the empty rooms from the occupied ones?
  • 15. Yellowbrick Feature Visualizers Use radviz or parallel coordinates to look for class separability
  • 16. Yellowbrick Feature Visualizers Use Rank2D for pairwise feature analysis
  • 17. …for text, too! Visualize top tokens, document distribution & part-of-speech tagging
  • 19. Why isn’t my model predictive? ● What to do with a low- accuracy classifier? ● Check for class imbalance. ● Visual cue that we might try stratified sampling, oversampling, or getting more data.
  • 21. Visualize the distribution of error to diagnose heteroscedasticity Yellowbrick Score Visualizers
  • 22. How do I tune this model?
  • 23. What’s the right k? ● How many clusters do you see? ● How do you pick an initial value for k in k- means clustering? ● How do you know whether to increase or decrease k? ● Is partitive clustering the right choice?
  • 24. Hyperparameter Tuning higher silhouette scores mean denser, more separate clusters The elbow shows the best value of k… Or suggests a different algorithm
  • 25. Hyperparameter Tuning Should I use Lasso, Ridge, or ElasticNet? Is regularization even working?
  • 27. Some ideas... How does token frequency change over time, in relation to other tokens?
  • 28. Some ideas... View the corpus hierarchically, after clustering
  • 29. Some ideas... Perform token network analysis
  • 30. Some ideas... Plot token co- occurrence
  • 31. Some ideas... Get an x-ray of the text
  • 32. Do you have an idea?
  • 33. The main API implemented by Scikit-Learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm, or a transformer that extracts/filters useful features from raw data. class Estimator(object): def fit(self, X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred Estimators
  • 34. Transformers are special cases of Estimators -- instead of making predictions, they transform the input dataset X to a new dataset X’. class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime Transformers
  • 35. A visualizer is an estimator that produces visualizations based on data rather than new datasets or predictions. Visualizers are intended to work in concert with Transformers and Estimators to shed light onto the modeling process. class Visualizer(Estimator): def draw(self): """ Draw the data """ self.ax.plot() def finalize(self): """ Complete the figure """ self.ax.set_title() def poof(self): """ Show the figure """ plt.show() Visualizers
  • 36.
  • 37. Thank you! Twitter: twitter.com/rebeccabilbro Github: github.com/rebeccabilbro Email: rebecca.bilbro@bytecubed.com

Notas do Editor

  1. Good morning and thank you for coming! Stickers! Today I’d like to tell you about an open source Python project I’ve been working on for the last two years. It’s called Yellowbrick, and it’s a tool you can use to steer the machine learning process using visual transformers. ================================================== TALK DESC/REFERENCE================================================== In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search. This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow. Ben and Rebecca are active contributors to the open source community, and this talk is based on Yellowbrick, a project they've been building together with the team at District Data Labs, an open source collaborative in Washington, DC. They are also co-authors of the forthcoming O'Reilly book, Applied Text Analysis with Python and organizers for Data Community DC - a not-for-profit organization of 9 meetups that organizes free monthly events and lectures for the local data community in Washington, DC. Visualizing the Model Selection Process 1. The Model Selection Process b. YB is part of the model selection workflow b. Feature Selection c. Algorithm Selection (Model Family/Form ⟶ Fitted Model) d. Hyperparameter Tuning e. Selection with Cross-Validation f. Try all the models! g. Hyperparameter Tuning h. Model selection is Search 3. Main Point: Visual Steering Improves Model Selection: a. Leads to better models (better F1/R2 Scores) b. Gets to good models more quickly c. Produces more insight about models d. Research People: prove this. 4. The Scikit-Learn API a. YB _extends_ the Scikit-Learn API b. Tricky because: functional/procedural matplotlib + OO b. Estimators c. Transformers d. Visualizers e. Pipelines f. Visual Workflows and Pipelines 5. Primary YB Requirements a. Fits into the sklearn API and workflow b. Implements matplotlib calls efficiently c. Low overhead if `poof()` is not called d. Just flexible enough for users to adapt to their data e. Easy to add new visualizers f. Looks as good as Seaborn g. Minimal dependencies: sklearn, numpy, matplotlib -- c'est tout! h. Primary Requirement: Implements Visual Steering 6. The Visualizer a. Current class hierarchy b. The Visualizer Interface c. Axes management d. `draw()`, `poof()`, and `finalize()` e. A simple example of a visualizer f. Feature Visualizers g. Score Visualizers h. Multi-Estimator Visualizers 9. Visual Pipelines a. Multiple Visualizations b. Interactivity 10. Optimizing Visualization! 11. Utilities a. Style management: Sequences, Palettes, Color Codes b. Best Fit Lines c. Type Detection (is_classifier, etc.) d. Exceptions 12. Documentation and Sphinx 13. Contributing a. Git/Branch Management b. Issues, milestones, and labels c. Waffle Board 13. User Testing and Research https://flic.kr/p/5EsRYP
  2. It all started in an ivory tower somewhere. Once upon a time, you had to go to school to learn machine learning. And you’d spend years and ultimately specialize in a particular model family Bayesian methods Gaussian processes support vector machines ...and get very very good at tuning those models.
  3. Then everything got really easy really fast. In 2010, Scikit-Learn was publicly released (by INRIA). And it started to grow very very quickly. Suddenly there were dozens of models at the fingertips of any Python programmer.
  4. Scikit-Learn has so much going for it Models, models, models Also transformers, vectorization tools, sample datasets Pipelines But arguably the best part is the consistent API You can plug the same data into nearly any of the models and it will work! So it becomes just an optimization problem You can loop through all the models and just pick the one with the best score. It almost doesn’t matter which model you use, or why or how it’s working. http://derekerdman.com/ilovemilkshakes/january2009/DO_IT/haircuts_try_them_all.jpg
  5. Slide showing MLaaS
  6. Except that hyperparameter space is large and gridsearch is slow if you don’t know already what you’re looking for Alpha/penalty for regularization Kernel function in support vector machine Leaves or depth of a decision tree Neighbors used in a nearest neighbor classifier Clusters in a k-means clustering https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Blindfold_%28PSF%29.png/1200px-Blindfold_%28PSF%29.png
  7. http://pages.cs.wisc.edu/~arun/vision/SIGMODRecord15.pdf
  8. The answer is visual steering. Leverage human pattern recognition and engagement. Get to better models, faster, with more insight, Overview first; zoom and filter; details on demand.
  9. Enhance model selection and evaluation Goal is not to replace other libraries. Model visualization not data visualization
  10. YB extends the Scikit-Learn API This is a high dimensional visualization problem.
  11. For classification; potentially we want to see if there is good separability Are some features more predictive than others? Unit circle requires normalization Drop points on the middle and are pulled out to outer edges Ordering of features matters Automatic ordering with optimization to minimize amt of overlap
  12. Feature engineering requires understanding of the relationships between features Visualize pairwise relationships Heatmap Pearson shows us strong correlations => potential collinearity Covariance helps us understand the sequence of relationships Other correlation metrics - we just used the ones that were implemented in numpy, but looking to expand
  13. Frequency distribution - top 50 tokens Stochastic Neighbor Embedding, decomposition then projection into 2D scatterplot Visual part-of-speech tagging
  14. YB extends the Scikit-Learn API
  15. Can we quickly detect class imbalance issues Stratified sampling, oversampling, getting more data -- tricks will help us balance But supervised methods can mask training data; simple graphs like these give us an at-a-glance reference As this gets into multiclass problems, domination could be harder to see and really effect modeling
  16. Receiver operating characteristics/area under curve Class imbalance Classification report heatmap - Quickly identify strengths & weaknesses of model - F1 vs Type I & Type II error Visual confusion matrix - misclassification on a per-class basis
  17. Where/why/how is model performing good/bad Prediction error plot - 45 degree line is theoretical perfect Residuals plot - 0 line is no error See change in amount of variance between x and y, or along x axis => heteroscedasticity
  18. YB extends the Scikit-Learn API
  19. Which regularization technique to use? Lasso/L1, Ridge/L2, or ElasticNet L1+L2 Regularization uses a Norm to penalize complexity at a rate, alpha The higher the alpha, the more the regularization. Complexity minimization reduces bias in the model, but increases variance Goal: select the smallest alpha such that error is minimized Visualize the tradeoff Surprising to see: higher alpha increasing error, alpha jumping around, etc. Embed R2, MSE, etc into the graph - quick reference
  20. Want to contribute? Here’s some information about how the Yellowbrick API works YB extends the Scikit-Learn API Where to hook in?
  21. Estimators learn from data Have a fit and predict method
  22. Transformers transform data Have a transform method
  23. Visualizers can be estimators or transformers Generally have a draw, finalize, and poof method
  24. Contribute! Needs: new features, testers, blog posts
  25. Stickers!