The DevOps landscape is well-understood and tools can be categorised by how they support the dev-build-deploy-monitor workflow. By comparison the MLOps landscape is complex and hard to understand. This presentation looks at the ML workflow that MLOps supports so that we can better understand the MLOps landscape.
2. Outline
1. MLOps Landscape
2. Data Science vs Programming
3. Traditional Programming E2E Workflow
4. Intro to ML E2E Workflow
5. MLOps Topics
a. Training
b. Serving
c. Monitoring
6. Advanced MLOps Challenges
7. Review
6. Running software performs actions in response to inputs.
Traditional programming codifies actions as explicit rules
ML does not codify explicitly.
Instead rules are indirectly set by capturing patterns from
data.
Different problem domains - ML more applicable to focused
numerical problems.
Why So Different?
7. Traditional programming
Think of old terminal systems
Start with hello world and add control
structures
Examples
Data Science
Classification problems (e.g. cat or not cat)
Regression problems (e.g. sales from ad
spend)
Start with MNIST or kaggle
12. Dev Build Journey
Compilation
Calculator user story
As a lazy person, I want to put numerical
operations into a screen so that I don’t have
to work out the answers.
13. ML Build Journey
Training Prediction
Training
Tracking
Data
Serving
Batch
E2E
Data Science Quesstion
Can we estimate/set/banchmark
employee pay from this data? Frameworks
14. ML is Different - Key Points
Training data and code together drive fitting
Closest thing to executable is a trained/weighted model (can
vary with toolkit)
Retraining can be necessary (e.g. online shop and fashion
trends)
Lots of data, long-running jobs
15. 1. User Story
2. Write code
3. Submit PR
4. Tests run automatically
5. Review and merge
6. New version builds
7. Built executable deployed to environment
8. Further tests
9. Promote to next environment
10. More tests etc.
11. PROD
12. Monitor - stacktraces or error codes
Docker as packaging. Driver is a code change (git)
Traditional Dev Workflow
16. Driver might be a code change. Or new data.
Data not in git.
More experimental - data driven and you’ve only a sample
of data.
Testing for quantifiable performance, not pass/fail.
Let’s focus on offline learning to simplify.
ML Workflows - Primer
17. ML E2E Workflow Intro
1. Data inputs and outputs. Preprocessed. Large.
2. Try stuff locally with a slice.
3. Try with more data as long-running experiments.
4. Collaboration - often in jupyter & git
5. Model may be pickled/serialized
6. Integrate into a running app e.g. add REST API
(serving)
7. Integration test with app.
8. Rollout & monitor performance metrics
18. Metrics Example
Online store example
A/B test
A leads to more conversions
But…
More negative reviews? Bounce-rate?
Interaction-level? Latency?
Krishen Siew - quora
20. Role of MLOps
Empower teams and break down silos
Provide ways to collaborate/self-serve
21. New Territory
Special challenges for ML.
No clear standards yet. We’ll drill into:
1. Training - slice of data, train a weighted model to
make predictions on unseen data.
2. Serving - call with HTTP.
3. Rollout and Monitoring - making sure it performs.
22. For long-running, intensive training jobs there’s
kubeflow pipelines, polyaxon, mlflow…
Broken into steps incl. cleaning and transformation (pre-
processing).
1 Training/Experimentation
23. Model Training
Each step can be long-running
Continuous Delivery for Machine Learning - martinfowler.com
27. Training and CI
Some training platforms have CI integration.
Result of a run could be a model. So
analogous to a CI build of an executable.
But how to say that the new version is
‘good’?
28. 2 Serving
Serving = use model via HTTP. Offline/batch is different.
Some platforms have serving or there’s dedicated solutions.
Seldon, Tensorflow Serving, AzureML, SageMaker
Often package the model and host (bucket) so the serving
solution can run it.
Serving can support rollout & monitoring.
29. Seldon ML Serving
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: sklearn
spec:
name: iris
predictors:
- graph:
children: []
implementation: SKLEARN_SERVER
modelUri: gs://seldon-models/sklearn/iris
name: classifier
name: default
replicas: 1
Open Source
K8s custom resource
Pods created to serve http
Docker option too
Data scientists like pickles
30. 3 Rollout and Monitoring
ML model trained on sample - need to keep checking with new data coming in
Rollout strategies:
Canary = % of traffic to new version as check
A/B Test = % split between versions for longer to monitor performance
Shadowing = All traffic to old and new model. Only the live model’s responses are used
31. Canary with Seldon
kind: SeldonDeployment
apiVersion: machinelearning.seldon.io/v1alpha2
metadata:
name: skiris
namespace: default
creationTimestamp:
spec:
name: skiris
predictors:
- name: default
graph:
name: skiris-default
implementation: SKLEARN_SERVER
modelUri: gs://seldon-models/sklearn/iris
replicas: 1
- name: canary
graph:
name: skiris-canary
implementation: XGBOOST_SERVER
modelUri: gs://seldon-models/xgboost/iris
replicas: 1
Traffic-splitting more typically
defined in gateway config.
Very common in ML.
In serving not gateway so data
scientist can define rollout.
36. Advanced Topics - Governance
● Explainability - why did it predict that?
○ Some orgs sticking to whitebox techniques - not neural nets
○ Blackbox is possible
● Provenance & Reproducibility (associating models to training runs to data to triggers)
○ Data versioning adds complexity
○ Competing tools for metadata
○ No agreed standards yet
● Bias & ethics
● Adversarial attacks
37. Summary
MLOps is new terrain.
ML workflows exploratory & data-driven.
MLOps enables ML workflows with:
● Data and compute-intensive experiments and training
● Artifact tracking
● Rollout strategies to work with monitoring
● Monitoring tools
Expand on metrics. Perhaps you’re recommending really controversial productions. Or maybe you’re using annoying pop-ups for suggestions.
So we’re seeing that this MLOps stuff is complicated and different from traditional DevOps. One challenge of this is that Data Science and DevOps can be different silos in many organisations and sometimes with a filter in between. So you get situations where a python pickle file ends up being passed to the DevOps team without any context. So naturally the team that is meant to run the model in production is like ‘what is this?’ For that situation this cartoon depicts a pretty reasonable reaction.
Other companies have a more mature setup. Here see more particular specialisms in play. In the bottom left we’ve got the data engineers