Making Data Science Scalable - 5 Lessons Learned
Making Data Science and Machine Learning scalable is not easy:
#1 Data Science in silos is bad
#2 ML-Feature stores should be at the heart of every ML-Platform
#3 Auto ML works great if you have a Feature store
#4 Treat Data Science Projekts more like Software Development
#5 Cloude based Infrastructure makes it easy to get started
Data Science MeetUp Cologne, Germany 16. May 2019
datasolut GmbH - https://datasolut.com
Multiple time frame trading analysis -brianshannon.pdf
Making Data Science Scalable - 5 Lessons Learned
1. Making Data Science Scalable
Lessons Learned from building
ML Platforms
16. Mai 2019
Laurenz Wuttke, Till Döhmen
“Orbital ATK Antares Launch (101410280027HQ)” by NASA HQ PHOTO is licensed under CC BY-NC-ND 2.0
2. About us
Till Döhmen
• Data Scientist / Software Engineer
• Working on RecSys & AutoML Platform
Laurenz Wuttke
• Data Scientist & Founder datasolut
• Working on RecSys & Feature Stores
• Blog: www.mlguide.de
3. Why do we need Scalability?
Rising…
• Number of Contributors
• Number of Use Cases
• Volume and Velocity of Data
• Complexity of Models
• Number of End-Users
• Frequency of Updates
4. What is a ML Platform?
• A company wide environment that
supports Data Scientist in their daily work
• Data Preparation
• Modelling
• Evaluation
• Deployment
• Model Monitoring
• etc..
• It is built to scale in multiple dimensions
with
growing demands
5. Source: “Hidden Technical Debt in Machine Learning System” by D. Sculley et al, 2016
ML is extremely technical
9. Data Science Silos
• Notebook instances on various (local)
machines
• No proper processes defined
• ML Pipeline Jungle
makes Machine Learning very
inefficient
and hard to maintain, track and scale!
&
Hard to hit business expectations!
11. Feature Stores
• Central data layer für Machine Learning
Features
• Quality tested & curated
• Highly automated processes
• Efficiency for Data Science Teams (e.g.
80% of workload) ! Focus on building
models
18. AutoML
• AutoML is advancing as a rapid pace
• Algorithm selection
• Hyperparameter tuning
• Model stacking
• (feature generation & selection)
• (neural architecture search)
• Usually works only on „flat“ tables
19.
20. AutoML
• Add Feature Generation to your AutoML Pipeline
• Don’t be too afraid of crazy black box models,
packages like SHAP can help with interpretability
• But Models are not optimized for runtime
AutoML
Feature
Selection
Feature
Generatio
n
21. #4: Treat Data Science (ML) Projects
more like Software Development
Projects
24. Is ML really like Software Dev.?
• ML feels more like debugging
• Experimentation-heavy
• Notebooks are the preferred
mode of development
• Not easy to version-control
• Not easy to deploy
25. Model Tracking
• We need a way to keep track of experiments
• Models
• Parameters
• Evaluation results
• Other artifacts (data)
• Tools like MLFlow or DVC facilitate that
• DVC more git-like, MLFlow explicit in-code
! Build up a (central) Model Repository
27. CI/CD
• In Software Dev. long established practice
• We can use CI/CD software to
• Schedule training/evaluation jobs
• Run automatic tests
• Integrate our models into e.g. a Docker container
• Ship our deployments to the production environment
• Provide mechanisms for failover etc.
28. Unit Testing
• (Automated) Testing & QA should be
in place for production systems
• Example test cases
• Modelling/infrastructure code for bugs
• Training process with predefined data
• Significant changes of data in Feature Store
• Significant changes in model output
• Testing of Data is challenging and an
open problem, start simple
29. Monitoring
Score distributions (may) change over time
[0 ,0.1] (0.2 ,0.3] (0.4 ,0.5] (0.6 ,0.7] (0.8 ,0.9]
Week 1 Week 4
• Validate & track your model performance
constantly
• Retrain (automatically) on new data if needed
33. Summary
• Don’t work in silos
• Create a feature store
• Keep track of your models
• Make use of AutoML where applicable
• Use Cloud Infrastructure if you want to start quickly
• Build your own ML Platform