Making Data Science Scalable - 5 Lessons Learned

Making Data Science Scalable
Lessons Learned from building
ML Platforms
16. Mai 2019
Laurenz Wuttke, Till Döhmen
“Orbital ATK Antares Launch (101410280027HQ)” by NASA HQ PHOTO is licensed under CC BY-NC-ND 2.0

About us
Till Döhmen
• Data Scientist / Software Engineer
• Working on RecSys & AutoML Platform
Laurenz Wuttke
• Data Scientist & Founder datasolut
• Working on RecSys & Feature Stores
• Blog: www.mlguide.de

Why do we need Scalability?
Rising…
• Number of Contributors
• Number of Use Cases
• Volume and Velocity of Data
• Complexity of Models
• Number of End-Users
• Frequency of Updates

What is a ML Platform?
• A company wide environment that
supports Data Scientist in their daily work
• Data Preparation
• Modelling
• Evaluation
• Deployment
• Model Monitoring
• etc..
• It is built to scale in multiple dimensions
with  
growing demands

Source: “Hidden Technical Debt in Machine Learning System” by D. Sculley et al, 2016
ML is extremely technical

Fblearner
ML Platforms are developing quickly

#1: Data Science in silos is bad

Data Science Silos
• Notebook instances on various (local)
machines
• No proper processes defined
• ML Pipeline Jungle
makes Machine Learning very
inefficient
and hard to maintain, track and scale!
&
Hard to hit business expectations!

#2: Feature stores should be at the
heart of every ML Platform

Feature Stores
• Central data layer für Machine Learning
Features
• Quality tested & curated
• Highly automated processes
• Efficiency for Data Science Teams (e.g.
80% of workload) ! Focus on building
models

Data Science
Data Engineering
Feature Engineering
ETL processes
data transformation
data cleaning
models &
visualizations

Old way…
Source: Logical Clocks AB

With a Feature Store…
Source: Logical Clocks AB

Data Science Projekt Costs
Resourcesneeded
0
250
500
750
1000
No. of Features in Features Store
10 15 20 40 60 100 200 250

#3: AutoML works great if you have a
feature store

AutoML
• AutoML is advancing as a rapid pace
• Algorithm selection
• Hyperparameter tuning
• Model stacking
• (feature generation & selection)
• (neural architecture search)
• Usually works only on „flat“ tables

AutoML
• Add Feature Generation to your AutoML Pipeline
• Don’t be too afraid of crazy black box models,  
packages like SHAP can help with interpretability
• But Models are not optimized for runtime
AutoML
Feature
Selection
Feature
Generatio
n

#4: Treat Data Science (ML) Projects
more like Software Development
Projects

Source: “Hidden Technical Debt in Machine Learning System” by D. Sculley et al, 2016

Model Design
Model
Training
Evaluation
Requirements
/Ideation
Data
Acquisition
Data
Preparation
Experimentin
g
Training/
Optimization
Integration
Testing/QA
Deployment
Maintenance
ML
Lifecycle
Requirements
Design
Implementatio
n
Integration/
Build
Testing/QA
Deployment
Maintenance
Software Dev.
Lifecycle

Is ML really like Software Dev.?
• ML feels more like debugging
• Experimentation-heavy
• Notebooks are the preferred  
mode of development
• Not easy to version-control
• Not easy to deploy

Model Tracking
• We need a way to keep track of experiments
• Models
• Parameters
• Evaluation results
• Other artifacts (data)
• Tools like MLFlow or DVC facilitate that
• DVC more git-like, MLFlow explicit in-code
! Build up a (central) Model Repository

Requirement
s/
Ideation
Data
Acquisitio
n
Experimenti
ng
Training /
Optimizin
g
Testing/
QA
Maintenanc
e
Model Design
Model
Training
Evaluation
Integration Deployment
Data
Preparatio
n
Requirement
s
Design Implementation
Testing/
QA
Integration
/Build
Deployment
Maintenanc
e
Software Development
Machine Learning
Source Code
Management
Continuous Integration / Continuous Delivery
Feature Store Model Repository CI / CD
Source Code Management

CI/CD
• In Software Dev. long established practice
• We can use CI/CD software to
• Schedule training/evaluation jobs
• Run automatic tests
• Integrate our models into e.g. a Docker container
• Ship our deployments to the production environment
• Provide mechanisms for failover etc.

Unit Testing
• (Automated) Testing & QA should be  
in place for production systems
• Example test cases
• Modelling/infrastructure code for bugs
• Training process with predefined data
• Significant changes of data in Feature Store
• Significant changes in model output
• Testing of Data is challenging and an 
open problem, start simple

Monitoring
Score distributions (may) change over time
[0 ,0.1] (0.2 ,0.3] (0.4 ,0.5] (0.6 ,0.7] (0.8 ,0.9]
Week 1 Week 4
• Validate & track your model performance
constantly
• Retrain (automatically) on new data if needed

#5: A Cloud-based Infrastructure makes
it easy to get started

Summary
• Don’t work in silos
• Create a feature store
• Keep track of your models
• Make use of AutoML where applicable
• Use Cloud Infrastructure if you want to start quickly
• Build your own ML Platform

Requirements
/
Ideation
Data
Acquisitio
n
Experimenti
ng
Training /
Optimizin
g
Testing/
QA
Maintenanc
e
Model Design
Model
Training
Evaluation
Integration Deployment
Data
Preparatio
n
ML Platform
Data/Feature
Management
Model
Management
CI/CD
ML Platform
AutoML
Unit
Testing
Moni-
toring
Cloud / On-Premise Infrastructure
Docker-
ization

Thank you!
• Questions…
• You find us on LinkedIn…
https://www.linkedin.com/in/tdoehmen/
https://www.linkedin.com/in/laurenz-wuttke/

Links
• https://engineering.linkedin.com/blog/2019/01/scaling-
machine-learning-productivity-at-linkedin
• https://databricks.com/session/zipline-airbnbs-machine-
learning-data-management-platform
• https://eng.uber.com/michelangelo/
• https://www.logicalclocks.com/feature-store/

Making Data Science Scalable - 5 Lessons Learned

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Making Data Science Scalable - 5 Lessons Learned

Semelhante a Making Data Science Scalable - 5 Lessons Learned (20)

Último

Último (20)

Making Data Science Scalable - 5 Lessons Learned