O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

EuroSciPy 2019: Visual diagnostics at scale

218 visualizações

Publicada em

The hunt for the most effective machine learning model is hard enough with a modest dataset, and much more so as our data grow! As we search for the optimal combination of features, algorithm, and hyperparameters, we often use tools like histograms, heatmaps, embeddings, and other plots to make our processes more informed and effective. However, large, high-dimensional datasets can prove particularly challenging. In this talk, we’ll explore a suite of visual diagnostics, investigate their strengths and weaknesses in face of increasingly big data, and consider how we can steer the machine learning process, not only purposefully but at scale!

Publicada em: Dados e análise
  • Seja o primeiro a comentar

EuroSciPy 2019: Visual diagnostics at scale

  1. 1. Visual Diagnostics at Scale EuroSciPy 2019
  2. 2. Dr. Rebecca Bilbro Chief Data Scientist, ICX Media Co-creator, Scikit-Yellowbrick Author, Applied Text Analysis with Python @rebeccabilbro
  3. 3. A tale of three datasets
  4. 4. Census Dataset 500K instances 50 features (age, occupation, education, sex, ethnicity marital status) Sarcasm Dataset 50K instances 5K features (“love”, 🙄, “totally”, “best”, “surprise”, “Sherlock”, capitalization, timestamp) Sensor Dataset 5M instances 15 features (Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, Toluene ppmv)
  5. 5. Scaling pain points are dataset- specific ● Many features ● Many instances ● Feature variance ● Heteroskedasticity ● Covariance ● Noise
  6. 6. Logistic Regression Fit Times (seconds) 500 - 5M instances / 5 - 50 features 10 seconds
  7. 7. Multilayer Perceptron Fit Times (seconds) 500 - 5M instances / 5 - 50 features 5 min, 48 seconds
  8. 8. Support Vector Machine Fit Times (seconds) 500 - 500K instances / 5 - 50 features 5 hours, 24 seconds
  9. 9. Support Vector Machine Fit Times (seconds) 500 - 500K instances / 5 - 50 features 5 hours, 24 seconds 😵
  10. 10. How to optimize? ● Be patient ● Be wrong ● Be rich ● Steer
  11. 11. Adventures in Model Visualization
  12. 12. import matplotlib.pyplot as plt from sklearn.datasets import load_iris from yellowbrick.features import ParallelCoordinates data = load_iris() oz = ParallelCoordinates(ax=axes[idx], fast=True) oz.fit_transform(data.data, data.target) oz.finalize() Each point drawn individually as connected line segment With standardization Points grouped by class, each class drawn as single segment
  13. 13. Bumps
  14. 14. class Estimator(object): def fit(self, X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime class Pipeline(Transfomer): @property def named_steps(self): """ Returns a sequence of estimators """ return self.steps @property def _final_estimator(self): """ Terminating estimator """ return self.steps[-1] The scikit-learn API self.X
  15. 15. class Visualizer(Estimator): def draw(self): """ Draw called from scikit-learn methods. """ return self.ax def finalize(self): self.set_title() self.legend() def poof(self): self.finalize() plt.show() import matplotlib.pyplot as plt from yellowbrick.base import Visualizer class MyVisualizer(Visualizer): def __init__(self, ax=None, **kwargs): super(MyVisualizer, self).__init__(ax, **kwargs) def fit(self, X, y=None): self.draw(X) return self def draw(self, X): if self.ax is None: self.ax = self.gca() self.ax.plt(X) def finalize(self): self.set_title("My Visualizer") The Yellowbrick API
  16. 16. Oneliners from sklearn.linear_model import Lasso from yellowbrick.regressor import ResidualsPlot # Option 1: scikit-learn style viz = ResidualsPlot(Lasso()) viz.fit(X_train, y_train) viz.score(X_test, y_test) viz.poof() from sklearn.linear_model import Lasso from yellowbrick.regressor import residuals_plot # Option 2: Quick Method viz = residuals_plot( Lasso(), X_train, y_train, X_test, y_test ) ��
  17. 17. Visual Comparison Tests
  18. 18. =========================================== test session starts ============================================ platform darwin -- Python 3.7.1, pytest-5.0.0, py-1.8.0, pluggy-0.12.0 rootdir: /Users/rbilbro/pyjects/yb, inifile: setup.cfg plugins: flakes-4.0.0, cov-2.7.1 collected 932 items tests/__init__.py s... [ 0%] tests/base.py s [ 0%] tests/conftest.py s [ 0%] tests/fixtures.py s [ 0%] tests/images.py s [ 0%] tests/rand.py s [ 0%] tests/test_base.py s............ [ 2%] ........................................................................................................... ........................................................................................................... ........................................................................................................... ........................................................................................................... tests/test_utils/test_target.py s............ [ 68%] tests/test_utils/test_timer.py s..... [ 68%] tests/test_utils/test_types.py s.................................................................... [ 70%] ....x................................x.............................................................. [ 72%] .... [ 73%] tests/test_utils/test_wrapper.py s.... ===================== 854 passed, 72 skipped, 6 xfailed, 33 warnings in 225.96 seconds =====================
  19. 19. Roadmap
  20. 20. Brushing and Filtering Ok for only 5 features Not good for 23 features
  21. 21. Machine-learning oriented aggregation YB (current) Seaborn
  22. 22. Parallelization with joblib Elbow Curve Validation Curve
  23. 23. from yellowbrick.features import Rank2D from yellowbrick.pipeline import VisualPipeline from yellowbrick.model_selection import CVScores from yellowbrick.regressor import PredictionError viz_pipe = VisualPipeline([ ('rank2d', Rank2D(features=features, algorithm='covariance')), ('prederr', PredictionError(model)), ('cvscores', CVScores(model, cv=cv, scoring='r2')) ]) Visual Pipelines
  24. 24. Models are aggregations, so are visualizations...
  25. 25. …so use visualizations to steer model selection!
  26. 26. scikit-yb.org

×