6. 6
Temperature check: who has...
● trained a ML model before?
● deployed a ML model for fun?
● deployed a ML model at work?
● an automated deployment pipeline for ML models?
8. 8
What today’s talk is about
Share principles and practices that can
make it easier for teams to iteratively deploy better ML
products
Share about what to strive towards, and
how to strive towards it
9. 9
Standing on the shoulders of giants
● @jezhumble
● @davefarley77
● @mat_kelcey
● @codingnirvana
● @kief
14. 15
● Iteratively improve our model (training with new {data, hyperparameters,
features}
● Correct any biases
● Model decay
● If it’s hard, do it more often
Why deploy frequently?
16. 17
Why deploy safely?
● ML models affect decisions that impact lives… in real-time
● Hippocratic oath for us: Do no harm.
● Safety enable us to iteratively improve ML products that better serve
people
17. 18
Machine learning is only one part of the problem/solution
Source: Hidden Technical Debt in Machine Learning Systems (Google, 2015)
Collecting data /
data engineering
training
ML
models
Deploying and monitoring
ML models
Focus of this talk
Finding the
right
business
problem to
solve
18. 19
Goal of today’s talk
Notebook
/
playgroun
d
:-( :-)
PROD
(maybe
)
Experiment /
Develop
Monitor Deploy
Test
Continuous
Delivery
commit and push
19. 4. So, how do we get there?
Challenges (and solutions from Continuous Delivery practices)
20. 21
Our story’s main characters
Mario the data scientist
Luigi the engineer
loca
l
PROD
21. Key concept: CI/CD Pipeline
Run unit
tests
Deploy
candidate
model to
STAGING
Deploy
model to
PROD
Train and
evaluate
model
push
Version
control
trigger
feedback
manua
l
trigger
Model
repositor
y
Data / feature repository
Local env
Model
repositor
y
Source: Continuous Delivery (Jez Humble, Dave Farley)
22. loca
l
PROD
#1: Automated configuration management
Challenge
● Snowflake (dev)
environments
● “Works on my machine!”
Solution
● Single-command setup
● Version control all dependencies, configuration
Benefits
● Enable experimentation by all teammates
● Production-like environment == discover potential
deployment issues early on
dev
24. loca
l
PROD
#2: Test pyramid
Solution
● Testing strategy
● Test every method
Benefits
● Fast feedback
● Safety harness allows team to boldly try new things /
refactor
Challenge
● How can I ensure my
changes haven’t broken
anything?
● How can I enforce the
“goodness” of our
models?
Unit tests
narrow/broad
integration tests
ML metrics
tests
Manual tests
dev
Automate
d
26. loca
l
PROD
#3: Continuous integration (CI) pipeline for automated testing
Solution
● CI/CD pipeline: automates unit tests → train → test →
deploy (to staging)
● Every code change is tested (assuming tests exist)
● Source code as the only source of software/models
Benefits
● Fast feedback
Challenge
● Everyone may not run
tests. “Goodness” checks
are done manually.
● We could deploy {bugs,
errors, bad models} to
production
dev unit tests train & testVCS
28. loca
l
PROD
#4: Artifact versioning
Challenge
● How can we revert to
previous models?
● Retraining == time-
consuming
● Manual
renaming/redeployment
s of old models (if we
still have them)
Solution
● Build your binaries once
● Each artifact is tagged with metadata (training data,
hyperparameters, datetime)
Benefits
● Save on build times
● Confidence in artifact increases down the pipeline
● Metadata enables reproducibility
dev train & test version artifactunit testsVCS
29. loca
l
PROD
#5: Continuous delivery (CD) pipeline for automated deployment
Solution
● Automated deployments triggered by pipeline
● Single-command deployment to staging/production
● Eliminate manual deployments
Benefits
● More rehearsal == More confidence
● Disaster recovery: (single-command) deployment of last
good model in production
Challenge
● Deployments are scary
● Manual deployments ==
potential for mistakes
dev train & test version artifact deploy-stagingunit testsVCS
30. 33
#5: CD pipeline for automated deployment (Demo)
# Deploy model (the actual model)
gcloud beta ml-engine versions create
$VERSION_NAME --model $MODEL_NAME
--origin $DEPLOYMENT_SOURCE
--runtime-version=1.5
--framework $FRAMEWORK
--python-version=3.5
31. 34
#5: CD pipeline for automated deployment (Demo)
# Deploy to prod
gcloud ml-engine versions set-default
$version_to_deploy_to_prod --
model=$MODEL_NAME
32. loca
l
PROD
#6: Canary releases + monitoring
Solution
● Request shadowing pattern (credit: @codingnirvana)
Benefits
● Confidence increases along the pipeline, backed by metrics
● Monitoring in production == Important source of feedback
Challenge
● How can I know if I’m
deploying a better /
worse model?
● Deployment to
production may not
work as expected
dev train & test version artifact deploy-staging deploy-canary-
prod
unit testsVCS
34. loca
l
PROD
#7: Start simple (tracer bullet)
Solution
● Start with simple model + simple features
● Create solid pipeline first
● But, not simpler than what is required (and, don’t take
expensive shortcuts)
Benefits
● Discover integration issues/requirements sooner
● Demonstrate working software to stakeholders in less time
Challenge
● Complex models ==
longer time to develop /
debug
● Getting all the “right”
features ==
weeks / months
dev
36. loca
l
PROD
#8: Collect more and better data with every release
Solution
● Think about how you can collect labels (immediately or
eventually) after serving predictions (credit: @mat_kelcey)
● Create bug reports for clients
● Complete the data pipeline cycle
● Caution: attempts to game your ML system
Benefits
● More and better data. Nuff said.
Challenge
● Data collection is hard
● Garbage in, garbage out
dev train & test version artifact deploy-staging deploy-canary-
prod
deploy-produnit testsVCS
37. loca
l
PROD
#9: Build cross-functional teams
Solution
● Build cross functional teams (data scientist, data engineer,
software engineer, UX, BA)
Benefits
● Less nails (because not everyone is a hammer)
● Improve empathy + reduce silos == productivity
Challenge
● How can we do all of the
above?
dev train & test version artifact deploy-staging deploy-canary-
prod
deploy-produnit testsVCS
38. loca
l
PROD
#10: Kaizen mindset
Solution
● Kaizen == 改善 == change for better
● Go through deployment health checklists as a team
Benefits
● Iteratively get to good
Challenge
● How can we do all of the
above?
dev train & test version artifact deploy-staging deploy-canary-
prod
deploy-produnit testsVCS
39. 43
#10: Kaizen - Health checklists
❏ General software engineering practices
❏ Source control (e.g. git)
❏ Unit tests
❏ CI pipeline to run automated tests
❏ Automated deployments
❏ Data / feature-related tests
❏ Test all code that creates input features, both in training and serving
❏ ...
❏ Model-related tests
❏ Test against a simpler model as a baseline
❏ ...
Source: A rubric for ML production systems (Google, 2016)
40. 44
#10: Kaizen - Health checks
● How much calendar time to deploy a model from staging to production?
● How much calendar time to add a new feature to the production model?
● How comfortable does your team feel about iteratively deploying
models?
43. A generalizable approach for deploying ML models frequently and safely
Run unit
tests
Deploy
candidate
model to
STAGING
Deploy
model to
PROD
Train and
evaluate
model
push
Version
control
Credit: Continuous Delivery (Jez Humble, Dave Farley)
trigger
feedback
manua
l
trigger
Model
repositor
y
Data / feature repository
Local env
Model
repositor
y
44. 48
Solve the right problem
We don’t have a machine learning problem.
We have a {business, data, software delivery, ML, UX}
problem
45. 49
Solve the right problem
Deployment and
monitoring
03
Machine learning02
Data collection01
Focus of
today’s talk
46. 50
How to deploy models to prod {frequently, safely, repeatably, reliably}?
1. Automate configuration management
2. Think about your test pyramid
3. Set up a continuous integration (CI) pipeline
4. Version your artifacts (i.e. models)
5. Automated deployment
6. Try canary releases
7. Start simple (tracer bullet)
8. Collect more and better data with every release
9. Build cross-functional teams
10. Kaizen / continuous improvement
49. 53
Resources for further reading
● Visibility and monitoring for machine learning (12-min video)
● Using continuous delivery with machine learning models to tackle fraud
● What’s your ML Test Score? A rubric for ML production systems (Google)
● Rules of Machine Learning (Google)
● Continuous Delivery (Jez Humble, Dave Farley)
● Why you need to improve your training data and how to do it
50. Backup materials /
miscellaneous stuff
This section is for placing any slides / ideas that may eventually make it to the
actual presentation
51. 55
Detailed outline
In the talk, we will show how we constructed our CI/CD/data pipelines, which consists of the following tasks
- Data pipeline
- Get data
- Transform/preprocess data
- Write to feature “repository”
- Local/dev
- Flesh out dev workflow. How can devs experiment / train / debug models?
- ML
- Get a slice of data
- Train model
- Evaluate model
- Web service
- CI - build and test stage
- Train and evaluate model on more data
- CI - deploy stage
- If tests pass, automatically deploy/promote artifact to staging
- Artifact should contain metadata that can help devs decide whether this new model is better than the older model that’s
in production (e.g. precision, accuracy, RMSE, training data-related metadata)
- Manual (one-click) deploy to production
- CI - Monitoring
- Monitor model's predicted values against real bitcoin values (and against existing model in production)
- Canary deployments / dark launches / request shadowing
- Kill switch / rollback: rollback to last known “good” model
52. 58
TODOs
● Goal of presentation
○ How to continuously, quickly and safely deploy machine learning models in production
○ Patterns for deploying ML models
■ Data to read → Model to train → artefacts to promote → deploy → monitoring
● Format
○ Run each step manually
○ Talk about what each step is doing / trying to achieve
○ Demo the “CI” version (Just Push)
● Build demo app
● Collect learnings
53. 59
Sketch out deployment pipeline
● Simple version (train val deploy)
● More complicated version (+ AB testing)
Principles first
Supported with tools and tech
54. 60
Target audience of the talk:
● ML enthusiasts; people who’ve been training/evaluating ML models in jupyter notebooks but
who cannot go beyond that because (i) they can’t get data or (ii) they’re not familiar with
deploying web services
56. 62
Deployment checklist (link)
This checklist should result in scripts/procedures needed to reliably and repeatedly deploy the
application into the production environment
• The steps required to deploy the application for the first time
• How to smoke-test the application and any services it uses as part of the deployment process
• The steps required to back out the deployment should it go wrong
• The steps required to back up and restore the application’s state
• The steps required to upgrade the application without destroying the application’s state
• The steps to restart or redeploy the application should it fail
• The location of the logs and a description of the information they contain
• The methods of monitoring the application
• The steps to perform any data migrations that are necessary as part of the release
• An issue log of problems from previous deployments, and their solutions
57. 63
Deployment checklist: II
• An asset and configuration management strategy.
• A description of the technology used for deployment. This should be agreed upon by both the operations and
development teams.
• A plan for implementing the deployment pipeline.
• An enumeration of the environments available for acceptance, capacity, integration, and user acceptance testing,
and the process by which builds will be moved through these environments.
• Requirements for monitoring the application, including any APIs or services the application should use to notify
the operations team of its state.
• Description of the integration with any external systems. At what stage and how are they tested as part of a
release? How do the operations personnel communicate with the provider in the event of a problem?
• Details of logging so that operations personnel can determine the application’s state and identify any error
conditions.
• The service-level agreements for the software, which will determine whether the application will require
techniques like failover and other high-availability strategies.
• How the initial deployment to production works.
58. ❏ General software engineering practices
❏ Source control (e.g. git)
❏ Unit tests
❏ CI/CD pipeline
❏ Run automated tests
❏ Automated, or one-step manual deployments
❏ An ability to conduct experiments comparing different system versions
❏ Data / feature-related tests
❏ Test that the distributions of each feature match your expectations
❏ Test that a model does not contain any features that have been manually
determined as unsuitable for use
❏ Test that your system maintains privacy controls across its entire data
pipeline
❏ Test all code that creates input features, both in training and serving
❏ Model-related tests 64
Model deployment readiness checklist (source)
60. ❏ ML Infrastructure tests
❏ Test the reproducibility of training
❏ Unit test model specification code
❏ Integration test the full ML pipeline
❏ Test model quality before attempting to serve it
❏ Test that a single example or training batch can be sent to the model, and changes to internal state can
be observed from training through to prediction
❏ Test models via a canary process before they enter production serving environments
❏ Test how quickly and safely a model can be rolled back to a previous serving version
❏ Monitoring tests
❏ Test for upstream instability in features, both in training and serving.
❏ Test that data invariants hold in training and serving inputs.
❏ Test that your training and serving features compute the same values (i.e. training-serving skew)
❏ Test for model staleness
❏ Test for NaNs or infinities appearing in your model during training or serving
❏ Test for dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage
❏ Test for regressions in prediction quality on served data
66
Model deployment readiness checklist (source)
I’m David and here’s Ramsey, and we’re going to share about how you can deploy ML models to production frequently and safely.
Note to self:
A talk is more about
telling a story around a topic
Changing people's perspective
Inspiring them to try something else
and giving them the tools for that.”
Empathize with audience. Don’t preach
Note: use “we”, rather than “you”
Got an idea (e.g. NLP sentiment analysis). Follow a ML tutorial
Built a model
Asked to deploy. (click) “You want me to .. what?” Bombarded with questions. How do I deploy? How do I load new data? How do I call .predict() without hitting shift+enter? How do I vectorize user input strings before passing it to the model?
We’re stumped. We don’t know where to start. We give up.
Before we go on, we want to take a quick temperature check
Bear this question in mind throughout the talk
Most of these are not ideas that Ramsey and I thought of. They are practices that smart these folks have thought of, and that have been tried and tested at our clients.
We built a sample app
What it does
Why we chose this stack / data source
How you can use it
To make this tangible, we’ve had to pick a stack. But focus on the patterns, and not our implementation
we built a demo so that we can have code to illustrate some points
but we ran out of time
So for the last few points, we'll talk abt concepts and how we would implement it
Just read the title. Don’t talk too much here.
Use fraud detection as an example.
Share about tracer bullet idea here
In other programming languages / frameworks, when we build something, we can share a link on Twitter and the rest of the world can use it
In ML, my experience === i/people just share screenshots of the loss curve (insert picture) or some object detection bounding boxes (insert pictures)
This is the problem facing many of us today.
We have tons of ML tutorials in local environment / jupyter notebooks, but very little / none about serving those models or continuous delivery/evolution of these models
Until something is in production, it creates value for no one except ourselves
Model decay (our model can get stale / dangerous)
Deploying frequently allows us to make iterative improvements to our model (training with new {data, hyperparameters, features}
cars, phones, ikea chairs go through multiple rounds of testing. Why should ML models be any different?
The irony is that ML has already started to impact all of our lives, but testing and safety is something that we rarely talk about in ML
ML models affect decisions that impact lives… in real-time
Safety is essential
Goal of today’s talk (in pictures)
“Ok, david - I’m sold on why this frequent and safe deployment thing is important. But what does it look like in practice?”
CI/CD pipeline - The main vehicle for everything we’re sharing today
It’s all about feedback
30 seconds - quick overview of this.
The model goes through different stages
Each of them solves a different problem, which we’ll talk about next
Generalizable approach: we can see it working for classifiers, regression models, deep learning models, NLP models, etc.
Snowflake
Every dataset is unique, non-reproducible, hand-cleaned with TLC
Challenge
Brittle glue code in ML
Unit tests
At lower levels, check edge cases, add more tests for all that
At higher levels, check happy path and integration
Skip if people get CI pipeline
Deployment
Provisioning
Configuration
Deploying your app
Tracer bullet
Deploying a simple thing is easier than a complex thing
Focus on deploying first. Focus on deployment pipeline. Don’t get distracted. We can come back to tuning models later
Benefits
Monitoring === important source of feedback
Find out when model are getting stale / dangerous
LIME - Local Interpretable Model-Agnostic Explanations
Caveat:
Monitoring ML metrics can be challenging because labels take time to come
Training serving skew
where the data seen at serving time differs in some way from the data used to train the model, leading to reduced prediction quality
Talk about just the first bullet
Pyception (Anaconda 2018 video) - a battle between data scientists and software engineers
Generalizable approach: we can see it working for classifiers, regression models, deep learning models, NLP models, etc.
Data / feature-related tests
Test that the distributions of each feature match your expectations. One example might be to test that Feature A takes on values 1 to 5, or that the two most common values of Feature B are "Harry" and "Potter" and they account for 10% of all values. This test can fail due to real external changes, which may require changes in your model.
Test that a model does not contain any features that have been manually determined as unsuitable for use. A feature might be unsuitable when it’s been discovered to be unreliable, overly expensive, etc. Tests are needed to ensure that such features are not accidentally included (e.g. via copy-paste) into new models.
Test that your system maintains privacy controls across its entire data pipeline. While strict access control is typically maintained on raw data, ML systems often export and transform that data during training. Test to ensure that access control is appropriately restricted across the entire pipeline.
Test all code that creates input features, both in training and serving. It can be tempting to believe feature creation code is simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital.
Model-related tests
Test that every model specification undergoes a code review and is checked in to a repository
Test the relationship between offline proxy metrics and the actual impact metrics. For example, how does a one-percent improvement in accuracy or AUC translate into effects on metrics of user satisfaction, such as click through rates? This can be measured in a small scale A/B experiment using an intentionally degraded model.
Test the impact of each tunable hyperparameter. Methods such as a grid search [6] or a more sophisticated hyperparameter search strategy [7] not only improve predictive performance, but also can uncover hidden reliability issues. For example, it can be surprising to observe the impact of massive increases in data parallelism on model accuracy.
Test the effect of model staleness. If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest? All models need to be updated eventually to account for changes in the external world; a careful assessment is important to guide such decisions.
Test against a simpler model as a baseline. Regularly testing against a very simple baseline model, such as a linear model with very few features, is an effective strategy both for confirming the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more sophisticated techniques.
Test model quality on important data slices. Slicing a data set along certain dimensions of interest provides fine-grained understanding of model performance. For example, important slices might be users by country or movies by genre. Examining sliced data avoids having fine-grained performance issues masked by a global summary metric.
Test the model for implicit bias. This may be viewed as an extension of examining important data slices, and may reveal issues that can be root-caused and addressed. For example, implicit bias might be induced by a lack of sufficient diversity in the training data.
Data / feature-related tests
Test that the distributions of each feature match your expectations. One example might be to test that Feature A takes on values 1 to 5, or that the two most common values of Feature B are "Harry" and "Potter" and they account for 10% of all values. This test can fail due to real external changes, which may require changes in your model.
Test that a model does not contain any features that have been manually determined as unsuitable for use. A feature might be unsuitable when it’s been discovered to be unreliable, overly expensive, etc. Tests are needed to ensure that such features are not accidentally included (e.g. via copy-paste) into new models.
Test that your system maintains privacy controls across its entire data pipeline. While strict access control is typically maintained on raw data, ML systems often export and transform that data during training. Test to ensure that access control is appropriately restricted across the entire pipeline.
Test all code that creates input features, both in training and serving. It can be tempting to believe feature creation code is simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital.
Model-related tests
Test that every model specification undergoes a code review and is checked in to a repository
Test the relationship between offline proxy metrics and the actual impact metrics. For example, how does a one-percent improvement in accuracy or AUC translate into effects on metrics of user satisfaction, such as click through rates? This can be measured in a small scale A/B experiment using an intentionally degraded model.
Test the impact of each tunable hyperparameter. Methods such as a grid search [6] or a more sophisticated hyperparameter search strategy [7] not only improve predictive performance, but also can uncover hidden reliability issues. For example, it can be surprising to observe the impact of massive increases in data parallelism on model accuracy.
Test the effect of model staleness. If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest? All models need to be updated eventually to account for changes in the external world; a careful assessment is important to guide such decisions.
Test against a simpler model as a baseline. Regularly testing against a very simple baseline model, such as a linear model with very few features, is an effective strategy both for confirming the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more sophisticated techniques.
Test model quality on important data slices. Slicing a data set along certain dimensions of interest provides fine-grained understanding of model performance. For example, important slices might be users by country or movies by genre. Examining sliced data avoids having fine-grained performance issues masked by a global summary metric.
Test the model for implicit bias. This may be viewed as an extension of examining important data slices, and may reveal issues that can be root-caused and addressed. For example, implicit bias might be induced by a lack of sufficient diversity in the training data.
ML Infrastructure tests
Test the reproducibility of training. Train two models on the same data, and observe any differences in aggregate metrics, sliced metrics, or example-by-example predictions. Large differences due to non-determinism can exacerbate debugging and troubleshooting.
Unit test model specification code. Although model specifications may seem like “configuration”, such files can have bugs and need to be tested. Useful assertions include testing that training results in decreased loss and that a model can restore from a checkpoint after a mid-training job crash.
Integration test the full ML pipeline. A good integration test runs all the way from original data sources, through feature creation, to training, and to serving. An integration test should run both continuously as well as with new releases of models or servers, in order to catch problems well before they reach production.
Test model quality before attempting to serve it. Useful tests include testing against data with known correct outputs and validating the aggregate quality, as well as comparing predictions to a previous version of the model.
Test that a single example or training batch can be sent to the model, and changes to internal state can be observed from training through to prediction. Observing internal state on small amounts of data is a useful debugging strategy for issues like numerical instability.
Test models via a canary process before they enter production serving environments. Modeling code can change more frequently than serving code, so there is a danger that an older serving system will not be able to serve a model trained from newer code. This includes testing that a model can be loaded into the production serving binaries and perform inference on production input data at all. It also includes a canary process, in which a new version is tested on a small trickle of live data.
Test how quickly and safely a model can be rolled back to a previous serving version. A model “roll back” procedure is useful in cases where upstream issues might result in unexpected changes to model quality. Being able to quickly revert to a previous known-good state is as crucial with ML models as with any other aspect of a serving system.
Monitoring tests
Test for upstream instability in features, both in training and serving. Upstream instability can create problems both at training and serving (inference) time. Training time instability is especially problematic when models are updated or retrained frequently. Serving time instability can occur even when the models themselves remain static. As examples, what alert would fire if one datacenter stops sending data? What if an upstream signal provider did a major version upgrade?
Test that data invariants hold in training and serving inputs. For example, test if Feature A and Feature B should always have the same number of non-zero values in each example, or that Feature C is always in the range (0, 100) or that class distribution is about 10:1.
Test that your training and serving features compute the same values. The codepaths that actually generate input features may differ for training and inference time, due to tradeoffs for flexibility vs. efficiency and other concerns. This is sometimes called “training/serving skew” and requires careful monitoring to detect and avoid.
Test for model staleness. For models that continually update, this means monitoring staleness throughout the training pipeline, to be able to determine in the case of a stale model where the pipeline has stalled. For example, if a daily job stopped generating an important table, what alert would fire?
Test for NaNs or infinities appearing in your model during training or serving. Invalid numeric values can easily crop up in your learning model, and knowing that they have occurred can speed diagnosis of the problem.
Test for dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage. The computational performance (as opposed to predictive quality) of an ML system is often a key concern at scale, and should be monitored via specialized regression testing. Dramatic regressions and slow regressions over time may require different kinds of monitoring.
Test for regressions in prediction quality on served data. For many systems, monitoring for nonzero bias can be an effective canary for identifying real problems, though it may also result from changes in the world.