SlideShare uma empresa Scribd logo
1 de 61
How to deploy machine learning models to
production (frequently and safely)
2
hello pycon
David Tan
@davified
Developer @ ThoughtWorks
3
About us
@thoughtworks
https://www.thoughtworks.com/intelligent-empowerment
1. First, a story about all
of us...
5
6
Temperature check: who has...
● trained a ML model before?
● deployed a ML model for fun?
● deployed a ML model at work?
● an automated deployment pipeline for ML models?
7
The million-dollar question
How can we reliably and repeatably take our models
from our laptop to production?
8
What today’s talk is about
Share principles and practices that can
make it easier for teams to iteratively deploy better ML
products
Share about what to strive towards, and
how to strive towards it
9
Standing on the shoulders of giants
● @jezhumble
● @davefarley77
● @mat_kelcey
● @codingnirvana
● @kief
10
The stack for today’s demo
11
Demo
2. Why deploy
frequently and safely?
14
Why deploy?
Until the model is in production,
it creates value for no one except ourselves
15
● Iteratively improve our model (training with new {data, hyperparameters,
features}
● Correct any biases
● Model decay
● If it’s hard, do it more often
Why deploy frequently?
16
Why deploy safely?
One of these things are not like the others
17
Why deploy safely?
● ML models affect decisions that impact lives… in real-time
● Hippocratic oath for us: Do no harm.
● Safety enable us to iteratively improve ML products that better serve
people
18
Machine learning is only one part of the problem/solution
Source: Hidden Technical Debt in Machine Learning Systems (Google, 2015)
Collecting data /
data engineering
training
ML
models
Deploying and monitoring
ML models
Focus of this talk
Finding the
right
business
problem to
solve
19
Goal of today’s talk
Notebook
/
playgroun
d
:-( :-)
PROD
(maybe
)
Experiment /
Develop
Monitor Deploy
Test
Continuous
Delivery
commit and push
4. So, how do we get there?
Challenges (and solutions from Continuous Delivery practices)
21
Our story’s main characters
Mario the data scientist
Luigi the engineer
loca
l
PROD
Key concept: CI/CD Pipeline
Run unit
tests
Deploy
candidate
model to
STAGING
Deploy
model to
PROD
Train and
evaluate
model
push
Version
control
trigger
feedback
manua
l
trigger
Model
repositor
y
Data / feature repository
Local env
Model
repositor
y
Source: Continuous Delivery (Jez Humble, Dave Farley)
loca
l
PROD
#1: Automated configuration management
Challenge
● Snowflake (dev)
environments
● “Works on my machine!”
Solution
● Single-command setup
● Version control all dependencies, configuration
Benefits
● Enable experimentation by all teammates
● Production-like environment == discover potential
deployment issues early on
dev
24
#1: Automated environment configuration management (Demo)
loca
l
PROD
#2: Test pyramid
Solution
● Testing strategy
● Test every method
Benefits
● Fast feedback
● Safety harness allows team to boldly try new things /
refactor
Challenge
● How can I ensure my
changes haven’t broken
anything?
● How can I enforce the
“goodness” of our
models?
Unit tests
narrow/broad
integration tests
ML metrics
tests
Manual tests
dev
Automate
d
28
#2: Test pyramid (Demo)
loca
l
PROD
#3: Continuous integration (CI) pipeline for automated testing
Solution
● CI/CD pipeline: automates unit tests → train → test →
deploy (to staging)
● Every code change is tested (assuming tests exist)
● Source code as the only source of software/models
Benefits
● Fast feedback
Challenge
● Everyone may not run
tests. “Goodness” checks
are done manually.
● We could deploy {bugs,
errors, bad models} to
production
dev unit tests train & testVCS
30
#3: CI pipeline (Demo)
loca
l
PROD
#4: Artifact versioning
Challenge
● How can we revert to
previous models?
● Retraining == time-
consuming
● Manual
renaming/redeployment
s of old models (if we
still have them)
Solution
● Build your binaries once
● Each artifact is tagged with metadata (training data,
hyperparameters, datetime)
Benefits
● Save on build times
● Confidence in artifact increases down the pipeline
● Metadata enables reproducibility
dev train & test version artifactunit testsVCS
loca
l
PROD
#5: Continuous delivery (CD) pipeline for automated deployment
Solution
● Automated deployments triggered by pipeline
● Single-command deployment to staging/production
● Eliminate manual deployments
Benefits
● More rehearsal == More confidence
● Disaster recovery: (single-command) deployment of last
good model in production
Challenge
● Deployments are scary
● Manual deployments ==
potential for mistakes
dev train & test version artifact deploy-stagingunit testsVCS
33
#5: CD pipeline for automated deployment (Demo)
# Deploy model (the actual model)
gcloud beta ml-engine versions create 
$VERSION_NAME --model $MODEL_NAME 
--origin $DEPLOYMENT_SOURCE 
--runtime-version=1.5 
--framework $FRAMEWORK 
--python-version=3.5
34
#5: CD pipeline for automated deployment (Demo)
# Deploy to prod
gcloud ml-engine versions set-default 
$version_to_deploy_to_prod  --
model=$MODEL_NAME
loca
l
PROD
#6: Canary releases + monitoring
Solution
● Request shadowing pattern (credit: @codingnirvana)
Benefits
● Confidence increases along the pipeline, backed by metrics
● Monitoring in production == Important source of feedback
Challenge
● How can I know if I’m
deploying a better /
worse model?
● Deployment to
production may not
work as expected
dev train & test version artifact deploy-staging deploy-canary-
prod
unit testsVCS
36
#6: Canary releases + monitoring (Demo)
ML App
loca
l
PROD
#7: Start simple (tracer bullet)
Solution
● Start with simple model + simple features
● Create solid pipeline first
● But, not simpler than what is required (and, don’t take
expensive shortcuts)
Benefits
● Discover integration issues/requirements sooner
● Demonstrate working software to stakeholders in less time
Challenge
● Complex models ==
longer time to develop /
debug
● Getting all the “right”
features ==
weeks / months
dev
38
#7: Start simple (tracer bullet) (Demo)
dev run-unit-tests
train-and
-evaluate-model deploy
loca
l
PROD
#8: Collect more and better data with every release
Solution
● Think about how you can collect labels (immediately or
eventually) after serving predictions (credit: @mat_kelcey)
● Create bug reports for clients
● Complete the data pipeline cycle
● Caution: attempts to game your ML system
Benefits
● More and better data. Nuff said.
Challenge
● Data collection is hard
● Garbage in, garbage out
dev train & test version artifact deploy-staging deploy-canary-
prod
deploy-produnit testsVCS
loca
l
PROD
#9: Build cross-functional teams
Solution
● Build cross functional teams (data scientist, data engineer,
software engineer, UX, BA)
Benefits
● Less nails (because not everyone is a hammer)
● Improve empathy + reduce silos == productivity
Challenge
● How can we do all of the
above?
dev train & test version artifact deploy-staging deploy-canary-
prod
deploy-produnit testsVCS
loca
l
PROD
#10: Kaizen mindset
Solution
● Kaizen == 改善 == change for better
● Go through deployment health checklists as a team
Benefits
● Iteratively get to good
Challenge
● How can we do all of the
above?
dev train & test version artifact deploy-staging deploy-canary-
prod
deploy-produnit testsVCS
43
#10: Kaizen - Health checklists
❏ General software engineering practices
❏ Source control (e.g. git)
❏ Unit tests
❏ CI pipeline to run automated tests
❏ Automated deployments
❏ Data / feature-related tests
❏ Test all code that creates input features, both in training and serving
❏ ...
❏ Model-related tests
❏ Test against a simpler model as a baseline
❏ ...
Source: A rubric for ML production systems (Google, 2016)
44
#10: Kaizen - Health checks
● How much calendar time to deploy a model from staging to production?
● How much calendar time to add a new feature to the production model?
● How comfortable does your team feel about iteratively deploying
models?
45
Conclusion
A generalizable approach for deploying ML models frequently and safely
Run unit
tests
Deploy
candidate
model to
STAGING
Deploy
model to
PROD
Train and
evaluate
model
push
Version
control
Credit: Continuous Delivery (Jez Humble, Dave Farley)
trigger
feedback
manua
l
trigger
Model
repositor
y
Data / feature repository
Local env
Model
repositor
y
48
Solve the right problem
We don’t have a machine learning problem.
We have a {business, data, software delivery, ML, UX}
problem
49
Solve the right problem
Deployment and
monitoring
03
Machine learning02
Data collection01
Focus of
today’s talk
50
How to deploy models to prod {frequently, safely, repeatably, reliably}?
1. Automate configuration management
2. Think about your test pyramid
3. Set up a continuous integration (CI) pipeline
4. Version your artifacts (i.e. models)
5. Automated deployment
6. Try canary releases
7. Start simple (tracer bullet)
8. Collect more and better data with every release
9. Build cross-functional teams
10. Kaizen / continuous improvement
THANK YOU
52
We’re hiring!
● Software Developers
(>= junior-level devs
welcome)
● UX Designer
● Senior Information
Security Consultant
53
Resources for further reading
● Visibility and monitoring for machine learning (12-min video)
● Using continuous delivery with machine learning models to tackle fraud
● What’s your ML Test Score? A rubric for ML production systems (Google)
● Rules of Machine Learning (Google)
● Continuous Delivery (Jez Humble, Dave Farley)
● Why you need to improve your training data and how to do it
Backup materials /
miscellaneous stuff
This section is for placing any slides / ideas that may eventually make it to the
actual presentation
55
Detailed outline
In the talk, we will show how we constructed our CI/CD/data pipelines, which consists of the following tasks
- Data pipeline
- Get data
- Transform/preprocess data
- Write to feature “repository”
- Local/dev
- Flesh out dev workflow. How can devs experiment / train / debug models?
- ML
- Get a slice of data
- Train model
- Evaluate model
- Web service
- CI - build and test stage
- Train and evaluate model on more data
- CI - deploy stage
- If tests pass, automatically deploy/promote artifact to staging
- Artifact should contain metadata that can help devs decide whether this new model is better than the older model that’s
in production (e.g. precision, accuracy, RMSE, training data-related metadata)
- Manual (one-click) deploy to production
- CI - Monitoring
- Monitor model's predicted values against real bitcoin values (and against existing model in production)
- Canary deployments / dark launches / request shadowing
- Kill switch / rollback: rollback to last known “good” model
58
TODOs
● Goal of presentation
○ How to continuously, quickly and safely deploy machine learning models in production
○ Patterns for deploying ML models
■ Data to read → Model to train → artefacts to promote → deploy → monitoring
● Format
○ Run each step manually
○ Talk about what each step is doing / trying to achieve
○ Demo the “CI” version (Just Push)
● Build demo app
● Collect learnings
59
Sketch out deployment pipeline
● Simple version (train val deploy)
● More complicated version (+ AB testing)
Principles first
Supported with tools and tech
60
Target audience of the talk:
● ML enthusiasts; people who’ve been training/evaluating ML models in jupyter notebooks but
who cannot go beyond that because (i) they can’t get data or (ii) they’re not familiar with
deploying web services
61
Potentially useful libraries
● ModelDB: Model repository + monitoring tool
62
Deployment checklist (link)
This checklist should result in scripts/procedures needed to reliably and repeatedly deploy the
application into the production environment
• The steps required to deploy the application for the first time
• How to smoke-test the application and any services it uses as part of the deployment process
• The steps required to back out the deployment should it go wrong
• The steps required to back up and restore the application’s state
• The steps required to upgrade the application without destroying the application’s state
• The steps to restart or redeploy the application should it fail
• The location of the logs and a description of the information they contain
• The methods of monitoring the application
• The steps to perform any data migrations that are necessary as part of the release
• An issue log of problems from previous deployments, and their solutions
63
Deployment checklist: II
• An asset and configuration management strategy.
• A description of the technology used for deployment. This should be agreed upon by both the operations and
development teams.
• A plan for implementing the deployment pipeline.
• An enumeration of the environments available for acceptance, capacity, integration, and user acceptance testing,
and the process by which builds will be moved through these environments.
• Requirements for monitoring the application, including any APIs or services the application should use to notify
the operations team of its state.
• Description of the integration with any external systems. At what stage and how are they tested as part of a
release? How do the operations personnel communicate with the provider in the event of a problem?
• Details of logging so that operations personnel can determine the application’s state and identify any error
conditions.
• The service-level agreements for the software, which will determine whether the application will require
techniques like failover and other high-availability strategies.
• How the initial deployment to production works.
❏ General software engineering practices
❏ Source control (e.g. git)
❏ Unit tests
❏ CI/CD pipeline
❏ Run automated tests
❏ Automated, or one-step manual deployments
❏ An ability to conduct experiments comparing different system versions
❏ Data / feature-related tests
❏ Test that the distributions of each feature match your expectations
❏ Test that a model does not contain any features that have been manually
determined as unsuitable for use
❏ Test that your system maintains privacy controls across its entire data
pipeline
❏ Test all code that creates input features, both in training and serving
❏ Model-related tests 64
Model deployment readiness checklist (source)
65
Model deployment readiness checklist (source)
❏ ML Infrastructure tests
❏ Test the reproducibility of training
❏ Unit test model specification code
❏ Integration test the full ML pipeline
❏ Test model quality before attempting to serve it
❏ Test that a single example or training batch can be sent to the model, and changes to internal state can
be observed from training through to prediction
❏ Test models via a canary process before they enter production serving environments
❏ Test how quickly and safely a model can be rolled back to a previous serving version
❏ Monitoring tests
❏ Test for upstream instability in features, both in training and serving.
❏ Test that data invariants hold in training and serving inputs.
❏ Test that your training and serving features compute the same values (i.e. training-serving skew)
❏ Test for model staleness
❏ Test for NaNs or infinities appearing in your model during training or serving
❏ Test for dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage
❏ Test for regressions in prediction quality on served data
66
Model deployment readiness checklist (source)
67
Source: https://www.safaribooksonline.com/library/view/strata-data-
conference/9781492025955/video318956.html

Mais conteúdo relacionado

Mais procurados

Agile Software Development and Test Driven Development: Agil8's Dave Putman 3...
Agile Software Development and Test Driven Development: Agil8's Dave Putman 3...Agile Software Development and Test Driven Development: Agil8's Dave Putman 3...
Agile Software Development and Test Driven Development: Agil8's Dave Putman 3...agil8 Ltd
 
Smits security driven development
Smits   security driven developmentSmits   security driven development
Smits security driven developmentSmitsMC LLC
 
xUnit and TDD: Why and How in Enterprise Software, August 2012
xUnit and TDD: Why and How in Enterprise Software, August 2012xUnit and TDD: Why and How in Enterprise Software, August 2012
xUnit and TDD: Why and How in Enterprise Software, August 2012Justin Gordon
 
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...Cωνσtantίnoς Giannoulis
 
Agile Development of High Performance Applications
Agile Development of High Performance ApplicationsAgile Development of High Performance Applications
Agile Development of High Performance ApplicationsFabian Lange
 
Agile testing overview
Agile testing overviewAgile testing overview
Agile testing overviewraianup
 
IIBA and Solvera May Event - Testing w Agile slides
IIBA and Solvera May Event - Testing w Agile slidesIIBA and Solvera May Event - Testing w Agile slides
IIBA and Solvera May Event - Testing w Agile slidesSaskatchewanIIBA
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids ProgrammingLynn Langit
 
extreme Programming
extreme Programmingextreme Programming
extreme ProgrammingBilal Shah
 
Ci tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsCi tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsLinards Liep
 
Shirly Ronen - rapid release flow and agile testing-as
Shirly Ronen - rapid release flow and agile testing-asShirly Ronen - rapid release flow and agile testing-as
Shirly Ronen - rapid release flow and agile testing-asAgileSparks
 
Usability Test Results Xtext New Project Wizard
Usability Test Results Xtext New Project WizardUsability Test Results Xtext New Project Wizard
Usability Test Results Xtext New Project WizardSandra Schering
 
STAREAST 2011 - 7 Steps To Improving Software Quality using Microsoft Test Ma...
STAREAST 2011 - 7 Steps To Improving Software Quality using Microsoft Test Ma...STAREAST 2011 - 7 Steps To Improving Software Quality using Microsoft Test Ma...
STAREAST 2011 - 7 Steps To Improving Software Quality using Microsoft Test Ma...Anna Russo
 
Agile vs Iterative vs Waterfall models
Agile vs Iterative vs Waterfall models Agile vs Iterative vs Waterfall models
Agile vs Iterative vs Waterfall models Marraju Bollapragada V
 
Shirly Ronen - User story testing activities
Shirly Ronen - User story testing activitiesShirly Ronen - User story testing activities
Shirly Ronen - User story testing activitiesAgileSparks
 
Waterfallacies V1 1
Waterfallacies V1 1Waterfallacies V1 1
Waterfallacies V1 1Jorge Boria
 
Scaling Continuous Integration Practices to Teams with Parallel Development
Scaling Continuous Integration Practices to Teams with Parallel DevelopmentScaling Continuous Integration Practices to Teams with Parallel Development
Scaling Continuous Integration Practices to Teams with Parallel DevelopmentIBM UrbanCode Products
 
Agile Software Development with XP
Agile Software Development with XPAgile Software Development with XP
Agile Software Development with XPVashira Ravipanich
 
Essential practices and thinking tools for Agile Adoption
Essential practices and thinking tools for Agile AdoptionEssential practices and thinking tools for Agile Adoption
Essential practices and thinking tools for Agile AdoptionSteven Mak
 

Mais procurados (20)

Agile Software Development and Test Driven Development: Agil8's Dave Putman 3...
Agile Software Development and Test Driven Development: Agil8's Dave Putman 3...Agile Software Development and Test Driven Development: Agil8's Dave Putman 3...
Agile Software Development and Test Driven Development: Agil8's Dave Putman 3...
 
Smits security driven development
Smits   security driven developmentSmits   security driven development
Smits security driven development
 
xUnit and TDD: Why and How in Enterprise Software, August 2012
xUnit and TDD: Why and How in Enterprise Software, August 2012xUnit and TDD: Why and How in Enterprise Software, August 2012
xUnit and TDD: Why and How in Enterprise Software, August 2012
 
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
 
Agile Development of High Performance Applications
Agile Development of High Performance ApplicationsAgile Development of High Performance Applications
Agile Development of High Performance Applications
 
Agile testing overview
Agile testing overviewAgile testing overview
Agile testing overview
 
IIBA and Solvera May Event - Testing w Agile slides
IIBA and Solvera May Event - Testing w Agile slidesIIBA and Solvera May Event - Testing w Agile slides
IIBA and Solvera May Event - Testing w Agile slides
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
extreme Programming
extreme Programmingextreme Programming
extreme Programming
 
Ci tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsCi tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepins
 
Shirly Ronen - rapid release flow and agile testing-as
Shirly Ronen - rapid release flow and agile testing-asShirly Ronen - rapid release flow and agile testing-as
Shirly Ronen - rapid release flow and agile testing-as
 
Usability Test Results Xtext New Project Wizard
Usability Test Results Xtext New Project WizardUsability Test Results Xtext New Project Wizard
Usability Test Results Xtext New Project Wizard
 
XP In 10 slides
XP In 10 slidesXP In 10 slides
XP In 10 slides
 
STAREAST 2011 - 7 Steps To Improving Software Quality using Microsoft Test Ma...
STAREAST 2011 - 7 Steps To Improving Software Quality using Microsoft Test Ma...STAREAST 2011 - 7 Steps To Improving Software Quality using Microsoft Test Ma...
STAREAST 2011 - 7 Steps To Improving Software Quality using Microsoft Test Ma...
 
Agile vs Iterative vs Waterfall models
Agile vs Iterative vs Waterfall models Agile vs Iterative vs Waterfall models
Agile vs Iterative vs Waterfall models
 
Shirly Ronen - User story testing activities
Shirly Ronen - User story testing activitiesShirly Ronen - User story testing activities
Shirly Ronen - User story testing activities
 
Waterfallacies V1 1
Waterfallacies V1 1Waterfallacies V1 1
Waterfallacies V1 1
 
Scaling Continuous Integration Practices to Teams with Parallel Development
Scaling Continuous Integration Practices to Teams with Parallel DevelopmentScaling Continuous Integration Practices to Teams with Parallel Development
Scaling Continuous Integration Practices to Teams with Parallel Development
 
Agile Software Development with XP
Agile Software Development with XPAgile Software Development with XP
Agile Software Development with XP
 
Essential practices and thinking tools for Agile Adoption
Essential practices and thinking tools for Agile AdoptionEssential practices and thinking tools for Agile Adoption
Essential practices and thinking tools for Agile Adoption
 

Semelhante a Deploying ML models to production (frequently and safely) - PYCON 2018

Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018David Tan
 
Continuous Intelligence Workshop
Continuous Intelligence WorkshopContinuous Intelligence Workshop
Continuous Intelligence WorkshopDavid Tan
 
Testing and DevOps Culture: Lessons Learned
Testing and DevOps Culture: Lessons LearnedTesting and DevOps Culture: Lessons Learned
Testing and DevOps Culture: Lessons LearnedLB Denker
 
Strata CA 2019: From Jupyter to Production Manu Mukerji
Strata CA 2019: From Jupyter to Production Manu MukerjiStrata CA 2019: From Jupyter to Production Manu Mukerji
Strata CA 2019: From Jupyter to Production Manu MukerjiManu Mukerji
 
Developers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomonDevelopers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomonIneke Scheffers
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionFlorian Wilhelm
 
Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Wit Jakuczun
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Developmentpmanvi
 
How EVERFI Moved from No Automation to Continuous Test Generation in 9 Months
How EVERFI Moved from No Automation to Continuous Test Generation in 9 MonthsHow EVERFI Moved from No Automation to Continuous Test Generation in 9 Months
How EVERFI Moved from No Automation to Continuous Test Generation in 9 MonthsApplitools
 
Cloud native development without the toil
Cloud native development without the toilCloud native development without the toil
Cloud native development without the toilAmbassador Labs
 
GOTOpia 2/2021 "Cloud Native Development Without the Toil: An Overview of Pra...
GOTOpia 2/2021 "Cloud Native Development Without the Toil: An Overview of Pra...GOTOpia 2/2021 "Cloud Native Development Without the Toil: An Overview of Pra...
GOTOpia 2/2021 "Cloud Native Development Without the Toil: An Overview of Pra...Daniel Bryant
 
Continuous Delivery Applied
Continuous Delivery AppliedContinuous Delivery Applied
Continuous Delivery AppliedExcella
 
Continuous Delivery Applied (Agile Richmond)
Continuous Delivery Applied (Agile Richmond)Continuous Delivery Applied (Agile Richmond)
Continuous Delivery Applied (Agile Richmond)Mike McGarr
 
The Holy Trinity of UI Testing by Diego Molina
The Holy Trinity of UI Testing by Diego MolinaThe Holy Trinity of UI Testing by Diego Molina
The Holy Trinity of UI Testing by Diego MolinaSauce Labs
 
Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Mike McGarr
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Cloudera, Inc.
 

Semelhante a Deploying ML models to production (frequently and safely) - PYCON 2018 (20)

Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018
 
Continuous Intelligence Workshop
Continuous Intelligence WorkshopContinuous Intelligence Workshop
Continuous Intelligence Workshop
 
Testing and DevOps Culture: Lessons Learned
Testing and DevOps Culture: Lessons LearnedTesting and DevOps Culture: Lessons Learned
Testing and DevOps Culture: Lessons Learned
 
Python and test
Python and testPython and test
Python and test
 
Strata CA 2019: From Jupyter to Production Manu Mukerji
Strata CA 2019: From Jupyter to Production Manu MukerjiStrata CA 2019: From Jupyter to Production Manu Mukerji
Strata CA 2019: From Jupyter to Production Manu Mukerji
 
Developers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomonDevelopers Testing - Girl Code at bloomon
Developers Testing - Girl Code at bloomon
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
 
Continuous Testing
Continuous TestingContinuous Testing
Continuous Testing
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 
Continuous Delivery Applied
Continuous Delivery AppliedContinuous Delivery Applied
Continuous Delivery Applied
 
How EVERFI Moved from No Automation to Continuous Test Generation in 9 Months
How EVERFI Moved from No Automation to Continuous Test Generation in 9 MonthsHow EVERFI Moved from No Automation to Continuous Test Generation in 9 Months
How EVERFI Moved from No Automation to Continuous Test Generation in 9 Months
 
Cloud native development without the toil
Cloud native development without the toilCloud native development without the toil
Cloud native development without the toil
 
GOTOpia 2/2021 "Cloud Native Development Without the Toil: An Overview of Pra...
GOTOpia 2/2021 "Cloud Native Development Without the Toil: An Overview of Pra...GOTOpia 2/2021 "Cloud Native Development Without the Toil: An Overview of Pra...
GOTOpia 2/2021 "Cloud Native Development Without the Toil: An Overview of Pra...
 
Continuous Delivery Applied
Continuous Delivery AppliedContinuous Delivery Applied
Continuous Delivery Applied
 
Continuous Delivery Applied (Agile Richmond)
Continuous Delivery Applied (Agile Richmond)Continuous Delivery Applied (Agile Richmond)
Continuous Delivery Applied (Agile Richmond)
 
The Holy Trinity of UI Testing by Diego Molina
The Holy Trinity of UI Testing by Diego MolinaThe Holy Trinity of UI Testing by Diego Molina
The Holy Trinity of UI Testing by Diego Molina
 
Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)Continuous delivery applied (DC CI User Group)
Continuous delivery applied (DC CI User Group)
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18
 

Último

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Deploying ML models to production (frequently and safely) - PYCON 2018

  • 1. How to deploy machine learning models to production (frequently and safely)
  • 4. 1. First, a story about all of us...
  • 5. 5
  • 6. 6 Temperature check: who has... ● trained a ML model before? ● deployed a ML model for fun? ● deployed a ML model at work? ● an automated deployment pipeline for ML models?
  • 7. 7 The million-dollar question How can we reliably and repeatably take our models from our laptop to production?
  • 8. 8 What today’s talk is about Share principles and practices that can make it easier for teams to iteratively deploy better ML products Share about what to strive towards, and how to strive towards it
  • 9. 9 Standing on the shoulders of giants ● @jezhumble ● @davefarley77 ● @mat_kelcey ● @codingnirvana ● @kief
  • 10. 10 The stack for today’s demo
  • 13. 14 Why deploy? Until the model is in production, it creates value for no one except ourselves
  • 14. 15 ● Iteratively improve our model (training with new {data, hyperparameters, features} ● Correct any biases ● Model decay ● If it’s hard, do it more often Why deploy frequently?
  • 15. 16 Why deploy safely? One of these things are not like the others
  • 16. 17 Why deploy safely? ● ML models affect decisions that impact lives… in real-time ● Hippocratic oath for us: Do no harm. ● Safety enable us to iteratively improve ML products that better serve people
  • 17. 18 Machine learning is only one part of the problem/solution Source: Hidden Technical Debt in Machine Learning Systems (Google, 2015) Collecting data / data engineering training ML models Deploying and monitoring ML models Focus of this talk Finding the right business problem to solve
  • 18. 19 Goal of today’s talk Notebook / playgroun d :-( :-) PROD (maybe ) Experiment / Develop Monitor Deploy Test Continuous Delivery commit and push
  • 19. 4. So, how do we get there? Challenges (and solutions from Continuous Delivery practices)
  • 20. 21 Our story’s main characters Mario the data scientist Luigi the engineer loca l PROD
  • 21. Key concept: CI/CD Pipeline Run unit tests Deploy candidate model to STAGING Deploy model to PROD Train and evaluate model push Version control trigger feedback manua l trigger Model repositor y Data / feature repository Local env Model repositor y Source: Continuous Delivery (Jez Humble, Dave Farley)
  • 22. loca l PROD #1: Automated configuration management Challenge ● Snowflake (dev) environments ● “Works on my machine!” Solution ● Single-command setup ● Version control all dependencies, configuration Benefits ● Enable experimentation by all teammates ● Production-like environment == discover potential deployment issues early on dev
  • 23. 24 #1: Automated environment configuration management (Demo)
  • 24. loca l PROD #2: Test pyramid Solution ● Testing strategy ● Test every method Benefits ● Fast feedback ● Safety harness allows team to boldly try new things / refactor Challenge ● How can I ensure my changes haven’t broken anything? ● How can I enforce the “goodness” of our models? Unit tests narrow/broad integration tests ML metrics tests Manual tests dev Automate d
  • 26. loca l PROD #3: Continuous integration (CI) pipeline for automated testing Solution ● CI/CD pipeline: automates unit tests → train → test → deploy (to staging) ● Every code change is tested (assuming tests exist) ● Source code as the only source of software/models Benefits ● Fast feedback Challenge ● Everyone may not run tests. “Goodness” checks are done manually. ● We could deploy {bugs, errors, bad models} to production dev unit tests train & testVCS
  • 28. loca l PROD #4: Artifact versioning Challenge ● How can we revert to previous models? ● Retraining == time- consuming ● Manual renaming/redeployment s of old models (if we still have them) Solution ● Build your binaries once ● Each artifact is tagged with metadata (training data, hyperparameters, datetime) Benefits ● Save on build times ● Confidence in artifact increases down the pipeline ● Metadata enables reproducibility dev train & test version artifactunit testsVCS
  • 29. loca l PROD #5: Continuous delivery (CD) pipeline for automated deployment Solution ● Automated deployments triggered by pipeline ● Single-command deployment to staging/production ● Eliminate manual deployments Benefits ● More rehearsal == More confidence ● Disaster recovery: (single-command) deployment of last good model in production Challenge ● Deployments are scary ● Manual deployments == potential for mistakes dev train & test version artifact deploy-stagingunit testsVCS
  • 30. 33 #5: CD pipeline for automated deployment (Demo) # Deploy model (the actual model) gcloud beta ml-engine versions create $VERSION_NAME --model $MODEL_NAME --origin $DEPLOYMENT_SOURCE --runtime-version=1.5 --framework $FRAMEWORK --python-version=3.5
  • 31. 34 #5: CD pipeline for automated deployment (Demo) # Deploy to prod gcloud ml-engine versions set-default $version_to_deploy_to_prod -- model=$MODEL_NAME
  • 32. loca l PROD #6: Canary releases + monitoring Solution ● Request shadowing pattern (credit: @codingnirvana) Benefits ● Confidence increases along the pipeline, backed by metrics ● Monitoring in production == Important source of feedback Challenge ● How can I know if I’m deploying a better / worse model? ● Deployment to production may not work as expected dev train & test version artifact deploy-staging deploy-canary- prod unit testsVCS
  • 33. 36 #6: Canary releases + monitoring (Demo) ML App
  • 34. loca l PROD #7: Start simple (tracer bullet) Solution ● Start with simple model + simple features ● Create solid pipeline first ● But, not simpler than what is required (and, don’t take expensive shortcuts) Benefits ● Discover integration issues/requirements sooner ● Demonstrate working software to stakeholders in less time Challenge ● Complex models == longer time to develop / debug ● Getting all the “right” features == weeks / months dev
  • 35. 38 #7: Start simple (tracer bullet) (Demo) dev run-unit-tests train-and -evaluate-model deploy
  • 36. loca l PROD #8: Collect more and better data with every release Solution ● Think about how you can collect labels (immediately or eventually) after serving predictions (credit: @mat_kelcey) ● Create bug reports for clients ● Complete the data pipeline cycle ● Caution: attempts to game your ML system Benefits ● More and better data. Nuff said. Challenge ● Data collection is hard ● Garbage in, garbage out dev train & test version artifact deploy-staging deploy-canary- prod deploy-produnit testsVCS
  • 37. loca l PROD #9: Build cross-functional teams Solution ● Build cross functional teams (data scientist, data engineer, software engineer, UX, BA) Benefits ● Less nails (because not everyone is a hammer) ● Improve empathy + reduce silos == productivity Challenge ● How can we do all of the above? dev train & test version artifact deploy-staging deploy-canary- prod deploy-produnit testsVCS
  • 38. loca l PROD #10: Kaizen mindset Solution ● Kaizen == 改善 == change for better ● Go through deployment health checklists as a team Benefits ● Iteratively get to good Challenge ● How can we do all of the above? dev train & test version artifact deploy-staging deploy-canary- prod deploy-produnit testsVCS
  • 39. 43 #10: Kaizen - Health checklists ❏ General software engineering practices ❏ Source control (e.g. git) ❏ Unit tests ❏ CI pipeline to run automated tests ❏ Automated deployments ❏ Data / feature-related tests ❏ Test all code that creates input features, both in training and serving ❏ ... ❏ Model-related tests ❏ Test against a simpler model as a baseline ❏ ... Source: A rubric for ML production systems (Google, 2016)
  • 40. 44 #10: Kaizen - Health checks ● How much calendar time to deploy a model from staging to production? ● How much calendar time to add a new feature to the production model? ● How comfortable does your team feel about iteratively deploying models?
  • 41. 45
  • 43. A generalizable approach for deploying ML models frequently and safely Run unit tests Deploy candidate model to STAGING Deploy model to PROD Train and evaluate model push Version control Credit: Continuous Delivery (Jez Humble, Dave Farley) trigger feedback manua l trigger Model repositor y Data / feature repository Local env Model repositor y
  • 44. 48 Solve the right problem We don’t have a machine learning problem. We have a {business, data, software delivery, ML, UX} problem
  • 45. 49 Solve the right problem Deployment and monitoring 03 Machine learning02 Data collection01 Focus of today’s talk
  • 46. 50 How to deploy models to prod {frequently, safely, repeatably, reliably}? 1. Automate configuration management 2. Think about your test pyramid 3. Set up a continuous integration (CI) pipeline 4. Version your artifacts (i.e. models) 5. Automated deployment 6. Try canary releases 7. Start simple (tracer bullet) 8. Collect more and better data with every release 9. Build cross-functional teams 10. Kaizen / continuous improvement
  • 48. 52 We’re hiring! ● Software Developers (>= junior-level devs welcome) ● UX Designer ● Senior Information Security Consultant
  • 49. 53 Resources for further reading ● Visibility and monitoring for machine learning (12-min video) ● Using continuous delivery with machine learning models to tackle fraud ● What’s your ML Test Score? A rubric for ML production systems (Google) ● Rules of Machine Learning (Google) ● Continuous Delivery (Jez Humble, Dave Farley) ● Why you need to improve your training data and how to do it
  • 50. Backup materials / miscellaneous stuff This section is for placing any slides / ideas that may eventually make it to the actual presentation
  • 51. 55 Detailed outline In the talk, we will show how we constructed our CI/CD/data pipelines, which consists of the following tasks - Data pipeline - Get data - Transform/preprocess data - Write to feature “repository” - Local/dev - Flesh out dev workflow. How can devs experiment / train / debug models? - ML - Get a slice of data - Train model - Evaluate model - Web service - CI - build and test stage - Train and evaluate model on more data - CI - deploy stage - If tests pass, automatically deploy/promote artifact to staging - Artifact should contain metadata that can help devs decide whether this new model is better than the older model that’s in production (e.g. precision, accuracy, RMSE, training data-related metadata) - Manual (one-click) deploy to production - CI - Monitoring - Monitor model's predicted values against real bitcoin values (and against existing model in production) - Canary deployments / dark launches / request shadowing - Kill switch / rollback: rollback to last known “good” model
  • 52. 58 TODOs ● Goal of presentation ○ How to continuously, quickly and safely deploy machine learning models in production ○ Patterns for deploying ML models ■ Data to read → Model to train → artefacts to promote → deploy → monitoring ● Format ○ Run each step manually ○ Talk about what each step is doing / trying to achieve ○ Demo the “CI” version (Just Push) ● Build demo app ● Collect learnings
  • 53. 59 Sketch out deployment pipeline ● Simple version (train val deploy) ● More complicated version (+ AB testing) Principles first Supported with tools and tech
  • 54. 60 Target audience of the talk: ● ML enthusiasts; people who’ve been training/evaluating ML models in jupyter notebooks but who cannot go beyond that because (i) they can’t get data or (ii) they’re not familiar with deploying web services
  • 55. 61 Potentially useful libraries ● ModelDB: Model repository + monitoring tool
  • 56. 62 Deployment checklist (link) This checklist should result in scripts/procedures needed to reliably and repeatedly deploy the application into the production environment • The steps required to deploy the application for the first time • How to smoke-test the application and any services it uses as part of the deployment process • The steps required to back out the deployment should it go wrong • The steps required to back up and restore the application’s state • The steps required to upgrade the application without destroying the application’s state • The steps to restart or redeploy the application should it fail • The location of the logs and a description of the information they contain • The methods of monitoring the application • The steps to perform any data migrations that are necessary as part of the release • An issue log of problems from previous deployments, and their solutions
  • 57. 63 Deployment checklist: II • An asset and configuration management strategy. • A description of the technology used for deployment. This should be agreed upon by both the operations and development teams. • A plan for implementing the deployment pipeline. • An enumeration of the environments available for acceptance, capacity, integration, and user acceptance testing, and the process by which builds will be moved through these environments. • Requirements for monitoring the application, including any APIs or services the application should use to notify the operations team of its state. • Description of the integration with any external systems. At what stage and how are they tested as part of a release? How do the operations personnel communicate with the provider in the event of a problem? • Details of logging so that operations personnel can determine the application’s state and identify any error conditions. • The service-level agreements for the software, which will determine whether the application will require techniques like failover and other high-availability strategies. • How the initial deployment to production works.
  • 58. ❏ General software engineering practices ❏ Source control (e.g. git) ❏ Unit tests ❏ CI/CD pipeline ❏ Run automated tests ❏ Automated, or one-step manual deployments ❏ An ability to conduct experiments comparing different system versions ❏ Data / feature-related tests ❏ Test that the distributions of each feature match your expectations ❏ Test that a model does not contain any features that have been manually determined as unsuitable for use ❏ Test that your system maintains privacy controls across its entire data pipeline ❏ Test all code that creates input features, both in training and serving ❏ Model-related tests 64 Model deployment readiness checklist (source)
  • 59. 65 Model deployment readiness checklist (source)
  • 60. ❏ ML Infrastructure tests ❏ Test the reproducibility of training ❏ Unit test model specification code ❏ Integration test the full ML pipeline ❏ Test model quality before attempting to serve it ❏ Test that a single example or training batch can be sent to the model, and changes to internal state can be observed from training through to prediction ❏ Test models via a canary process before they enter production serving environments ❏ Test how quickly and safely a model can be rolled back to a previous serving version ❏ Monitoring tests ❏ Test for upstream instability in features, both in training and serving. ❏ Test that data invariants hold in training and serving inputs. ❏ Test that your training and serving features compute the same values (i.e. training-serving skew) ❏ Test for model staleness ❏ Test for NaNs or infinities appearing in your model during training or serving ❏ Test for dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage ❏ Test for regressions in prediction quality on served data 66 Model deployment readiness checklist (source)

Notas do Editor

  1. I’m David and here’s Ramsey, and we’re going to share about how you can deploy ML models to production frequently and safely. Note to self: A talk is more about telling a story around a topic Changing people's perspective Inspiring them to try something else and giving them the tools for that.” Empathize with audience. Don’t preach
  2. Note: use “we”, rather than “you” Got an idea (e.g. NLP sentiment analysis). Follow a ML tutorial Built a model Asked to deploy. (click) “You want me to .. what?” Bombarded with questions. How do I deploy? How do I load new data? How do I call .predict() without hitting shift+enter? How do I vectorize user input strings before passing it to the model? We’re stumped. We don’t know where to start. We give up.
  3. Before we go on, we want to take a quick temperature check
  4. Bear this question in mind throughout the talk
  5. Most of these are not ideas that Ramsey and I thought of. They are practices that smart these folks have thought of, and that have been tried and tested at our clients.
  6. We built a sample app What it does Why we chose this stack / data source How you can use it To make this tangible, we’ve had to pick a stack. But focus on the patterns, and not our implementation
  7. we built a demo so that we can have code to illustrate some points but we ran out of time So for the last few points, we'll talk abt concepts and how we would implement it
  8. Just read the title. Don’t talk too much here.
  9. Use fraud detection as an example. Share about tracer bullet idea here
  10. In other programming languages / frameworks, when we build something, we can share a link on Twitter and the rest of the world can use it In ML, my experience === i/people just share screenshots of the loss curve (insert picture) or some object detection bounding boxes (insert pictures) This is the problem facing many of us today. We have tons of ML tutorials in local environment / jupyter notebooks, but very little / none about serving those models or continuous delivery/evolution of these models Until something is in production, it creates value for no one except ourselves
  11. Model decay (our model can get stale / dangerous) Deploying frequently allows us to make iterative improvements to our model (training with new {data, hyperparameters, features}
  12. cars, phones, ikea chairs go through multiple rounds of testing. Why should ML models be any different? The irony is that ML has already started to impact all of our lives, but testing and safety is something that we rarely talk about in ML
  13. ML models affect decisions that impact lives… in real-time Safety is essential
  14. Goal of today’s talk (in pictures)
  15. “Ok, david - I’m sold on why this frequent and safe deployment thing is important. But what does it look like in practice?”
  16. CI/CD pipeline - The main vehicle for everything we’re sharing today It’s all about feedback 30 seconds - quick overview of this. The model goes through different stages Each of them solves a different problem, which we’ll talk about next Generalizable approach: we can see it working for classifiers, regression models, deep learning models, NLP models, etc.
  17. Snowflake Every dataset is unique, non-reproducible, hand-cleaned with TLC
  18. Challenge Brittle glue code in ML Unit tests At lower levels, check edge cases, add more tests for all that At higher levels, check happy path and integration
  19. Skip if people get CI pipeline
  20. Deployment Provisioning Configuration Deploying your app
  21. Tracer bullet Deploying a simple thing is easier than a complex thing Focus on deploying first. Focus on deployment pipeline. Don’t get distracted. We can come back to tuning models later
  22. Benefits Monitoring === important source of feedback Find out when model are getting stale / dangerous LIME - Local Interpretable Model-Agnostic Explanations Caveat: Monitoring ML metrics can be challenging because labels take time to come
  23. Training serving skew where the data seen at serving time differs in some way from the data used to train the model, leading to reduced prediction quality
  24. Talk about just the first bullet
  25. Pyception (Anaconda 2018 video) - a battle between data scientists and software engineers
  26. Generalizable approach: we can see it working for classifiers, regression models, deep learning models, NLP models, etc.
  27. Data / feature-related tests Test that the distributions of each feature match your expectations. One example might be to test that Feature A takes on values 1 to 5, or that the two most common values of Feature B are "Harry" and "Potter" and they account for 10% of all values. This test can fail due to real external changes, which may require changes in your model. Test that a model does not contain any features that have been manually determined as unsuitable for use. A feature might be unsuitable when it’s been discovered to be unreliable, overly expensive, etc. Tests are needed to ensure that such features are not accidentally included (e.g. via copy-paste) into new models. Test that your system maintains privacy controls across its entire data pipeline. While strict access control is typically maintained on raw data, ML systems often export and transform that data during training. Test to ensure that access control is appropriately restricted across the entire pipeline. Test all code that creates input features, both in training and serving. It can be tempting to believe feature creation code is simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital. Model-related tests Test that every model specification undergoes a code review and is checked in to a repository Test the relationship between offline proxy metrics and the actual impact metrics. For example, how does a one-percent improvement in accuracy or AUC translate into effects on metrics of user satisfaction, such as click through rates? This can be measured in a small scale A/B experiment using an intentionally degraded model. Test the impact of each tunable hyperparameter. Methods such as a grid search [6] or a more sophisticated hyperparameter search strategy [7] not only improve predictive performance, but also can uncover hidden reliability issues. For example, it can be surprising to observe the impact of massive increases in data parallelism on model accuracy. Test the effect of model staleness. If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest? All models need to be updated eventually to account for changes in the external world; a careful assessment is important to guide such decisions. Test against a simpler model as a baseline. Regularly testing against a very simple baseline model, such as a linear model with very few features, is an effective strategy both for confirming the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more sophisticated techniques. Test model quality on important data slices. Slicing a data set along certain dimensions of interest provides fine-grained understanding of model performance. For example, important slices might be users by country or movies by genre. Examining sliced data avoids having fine-grained performance issues masked by a global summary metric. Test the model for implicit bias. This may be viewed as an extension of examining important data slices, and may reveal issues that can be root-caused and addressed. For example, implicit bias might be induced by a lack of sufficient diversity in the training data.
  28. Data / feature-related tests Test that the distributions of each feature match your expectations. One example might be to test that Feature A takes on values 1 to 5, or that the two most common values of Feature B are "Harry" and "Potter" and they account for 10% of all values. This test can fail due to real external changes, which may require changes in your model. Test that a model does not contain any features that have been manually determined as unsuitable for use. A feature might be unsuitable when it’s been discovered to be unreliable, overly expensive, etc. Tests are needed to ensure that such features are not accidentally included (e.g. via copy-paste) into new models. Test that your system maintains privacy controls across its entire data pipeline. While strict access control is typically maintained on raw data, ML systems often export and transform that data during training. Test to ensure that access control is appropriately restricted across the entire pipeline. Test all code that creates input features, both in training and serving. It can be tempting to believe feature creation code is simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital. Model-related tests Test that every model specification undergoes a code review and is checked in to a repository Test the relationship between offline proxy metrics and the actual impact metrics. For example, how does a one-percent improvement in accuracy or AUC translate into effects on metrics of user satisfaction, such as click through rates? This can be measured in a small scale A/B experiment using an intentionally degraded model. Test the impact of each tunable hyperparameter. Methods such as a grid search [6] or a more sophisticated hyperparameter search strategy [7] not only improve predictive performance, but also can uncover hidden reliability issues. For example, it can be surprising to observe the impact of massive increases in data parallelism on model accuracy. Test the effect of model staleness. If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest? All models need to be updated eventually to account for changes in the external world; a careful assessment is important to guide such decisions. Test against a simpler model as a baseline. Regularly testing against a very simple baseline model, such as a linear model with very few features, is an effective strategy both for confirming the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more sophisticated techniques. Test model quality on important data slices. Slicing a data set along certain dimensions of interest provides fine-grained understanding of model performance. For example, important slices might be users by country or movies by genre. Examining sliced data avoids having fine-grained performance issues masked by a global summary metric. Test the model for implicit bias. This may be viewed as an extension of examining important data slices, and may reveal issues that can be root-caused and addressed. For example, implicit bias might be induced by a lack of sufficient diversity in the training data.
  29. ML Infrastructure tests Test the reproducibility of training. Train two models on the same data, and observe any differences in aggregate metrics, sliced metrics, or example-by-example predictions. Large differences due to non-determinism can exacerbate debugging and troubleshooting. Unit test model specification code. Although model specifications may seem like “configuration”, such files can have bugs and need to be tested. Useful assertions include testing that training results in decreased loss and that a model can restore from a checkpoint after a mid-training job crash. Integration test the full ML pipeline. A good integration test runs all the way from original data sources, through feature creation, to training, and to serving. An integration test should run both continuously as well as with new releases of models or servers, in order to catch problems well before they reach production. Test model quality before attempting to serve it. Useful tests include testing against data with known correct outputs and validating the aggregate quality, as well as comparing predictions to a previous version of the model. Test that a single example or training batch can be sent to the model, and changes to internal state can be observed from training through to prediction. Observing internal state on small amounts of data is a useful debugging strategy for issues like numerical instability. Test models via a canary process before they enter production serving environments. Modeling code can change more frequently than serving code, so there is a danger that an older serving system will not be able to serve a model trained from newer code. This includes testing that a model can be loaded into the production serving binaries and perform inference on production input data at all. It also includes a canary process, in which a new version is tested on a small trickle of live data. Test how quickly and safely a model can be rolled back to a previous serving version. A model “roll back” procedure is useful in cases where upstream issues might result in unexpected changes to model quality. Being able to quickly revert to a previous known-good state is as crucial with ML models as with any other aspect of a serving system. Monitoring tests Test for upstream instability in features, both in training and serving. Upstream instability can create problems both at training and serving (inference) time. Training time instability is especially problematic when models are updated or retrained frequently. Serving time instability can occur even when the models themselves remain static. As examples, what alert would fire if one datacenter stops sending data? What if an upstream signal provider did a major version upgrade? Test that data invariants hold in training and serving inputs. For example, test if Feature A and Feature B should always have the same number of non-zero values in each example, or that Feature C is always in the range (0, 100) or that class distribution is about 10:1. Test that your training and serving features compute the same values. The codepaths that actually generate input features may differ for training and inference time, due to tradeoffs for flexibility vs. efficiency and other concerns. This is sometimes called “training/serving skew” and requires careful monitoring to detect and avoid. Test for model staleness. For models that continually update, this means monitoring staleness throughout the training pipeline, to be able to determine in the case of a stale model where the pipeline has stalled. For example, if a daily job stopped generating an important table, what alert would fire? Test for NaNs or infinities appearing in your model during training or serving. Invalid numeric values can easily crop up in your learning model, and knowing that they have occurred can speed diagnosis of the problem. Test for dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage. The computational performance (as opposed to predictive quality) of an ML system is often a key concern at scale, and should be monitored via specialized regression testing. Dramatic regressions and slow regressions over time may require different kinds of monitoring. Test for regressions in prediction quality on served data. For many systems, monitoring for nonzero bias can be an effective canary for identifying real problems, though it may also result from changes in the world.