O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Continuous Intelligence: Moving Machine Learning into Production Reliably

A workshop by Danilo Sato, Christoph Windheuser, Emily Gorcenski, and Arif Wider, given at Strata Data Conference 2019 in London.

Abstract:
So you want to include a machine learning component in your IT systems? The process is a little more involved than clicking through an AI tutorial on your laptop. It’s not just the first working model you run that you need to consider; you also need to think about things like integration, scaling, and testing. What’s more, postlaunch, you’ll want to continuously adapt your model to respond to the changing environment.

ThoughtWorks pioneered continuous delivery—a set of tools and processes that ensure that software under development can be reliably released to production at any time and with high frequency.

Danilo Sato and Christoph Windheuser demonstrate how to apply continuous delivery to machine learning—what’s known as continuous intelligence. In a live scenario, you’ll change a machine learning model in a development environment, test its new performance, and, depending on the outcome, automatically deploy the new model into a production environment. The tech stack for this scenario will be Python, DVC (Data Science Version Control), and GoCD.

  • Seja o primeiro a comentar

Continuous Intelligence: Moving Machine Learning into Production Reliably

  1. 1. 1 Continuous Intelligence Moving Machine Learning Application into Production Reliably Christoph Windheuser Danilo Sato Emily Gorcenski Arif Wider ThoughtWorks Inc. WORKSHOP ON WHY AND HOW TO APPLY CONTINUOUS DELIVERY TO MACHINE LEARNING (CD4ML) ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  2. 2. 2 Structure of Today’s Workshop - INTRODUCTION TO THE TOPIC - EXERCISE 1: SETUP - EXERCISE 2: DEPLOYMENT PIPELINE - BREAK - EXERCISE 3: ML PIPELINE - EXERCISE 4: TRACKING EXPERIMENTS - EXERCISE 5: MODEL MONITORING ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  3. 3. 5000+ technologists with 40 offices in 14 countries Partner for technology driven business transformation ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 join.thoughtworks.com
  4. 4. #1 in Agile and Continuous Delivery 100+ books written ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  5. 5. ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  6. 6. TECHNIQUES Continuous delivery for machine learning (CD4ML) models #8 ASSESS 8 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  7. 7. ©ThoughtWorks 2018 Commercial in Confidence CONTINUOUS INTELLIGENCE CYCLE ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 7
  8. 8. CD4ML isn’t a technology or a tool; it is a practice and a set of principles. Quality is built into software and improvement is always possible. But machine learning systems have unique challenges; unlike deterministic software, it is difficult—or impossible—to understand the behavior of data-driven intelligent systems. This poses a huge challenge when it comes to deploying machine learning systems in accordance with CD principles. 8 PRODUCTIONIZING ML IS HARD Production systems should be: ● Reproducible ● Testable ● Auditable ● Continuously Improving HOW DO WE APPLY DECADES OF SOFTWARE DELIVERY EXPERIENCE TO INTELLIGENT SYSTEMS? ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  9. 9. CD4ML isn’t a technology or a tool; it is a practice and a set of principles. Quality is built into software and improvement is always possible. But machine learning systems have unique challenges; unlike deterministic software, it is difficult—or impossible—to understand the behavior of data-driven intelligent systems. This poses a huge challenge when it comes to deploying machine learning systems in accordance with CD principles. 9 PRODUCTIONIZING ML IS HARD Production systems should be: ● Reproducible ● Testable ● Auditable ● Continuously Improving Machine Learning is: ● Non-deterministic ● Hard to test ● Hard to explain ● Hard to improve HOW DO WE APPLY DECADES OF SOFTWARE DELIVERY EXPERIENCE TO INTELLIGENT SYSTEMS? ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  10. 10. MANY SOURCES OF CHANGE 10 ModelData Code + + Schema Sampling over Time Volume ... Research, Experiments Training on New Data Performance ... New Features Bug Fixes Dependencies ... Icons created by Noura Mbarki and I Putu Kharismayadi from Noun Project ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  11. 11. “Continuous Delivery is the ability to get changes of all types — including new features, configuration changes, bug fixes and experiments — into production, or into the hands of users, safely and quickly in a sustainable way.” - Jez Humble & Dave Farley 11 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  12. 12. ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 PRINCIPLES OF CONTINUOUS DELIVERY 12 → Create a Repeatable, Reliable Process for Releasing Software → Automate Almost Everything → Build Quality In → Work in Small Batches → Keep Everything in Source Control → Done Means “Released” → Improve Continuously
  13. 13. WHAT DO WE NEED IN OUR STACK? 13 Doing CD with Machine Learning is still a hard problem MODEL PERFORMANCE ASSESSMENT TRACKING VERSION CONTROL AND ARTIFACT REPOSITORIES ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 MODEL MONITORING AND OBSERVABILITY DISCOVERABLE AND ACCESSIBLE DATA CONTINUOUS DELIVERY ORCHESTRATION TO COMBINE PIPELINES INFRASTRUCTURE FOR MULTIPLE ENVIRONMENTS AND EXPERIMENTS
  14. 14. PUTTING EVERYTHING TOGETHER 14 Data Science, Model Building Training Data Source Code + Executables Model + parameters CD Tools and Repositories DiscoverableandAccessibleData ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  15. 15. PUTTING EVERYTHING TOGETHER 15 Data Science, Model Building Training Data Source Code + Executables Model Evaluation Test Data Model + parameters CD Tools and Repositories DiscoverableandAccessibleData ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  16. 16. PUTTING EVERYTHING TOGETHER 16 Data Science, Model Building Training Data Source Code + Executables Model Evaluation Productionize Model Test Data Model + parameters CD Tools and Repositories DiscoverableandAccessibleData ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  17. 17. PUTTING EVERYTHING TOGETHER 17 Data Science, Model Building Training Data Source Code + Executables Model Evaluation Productionize Model Integration Testing Test Data Model + parameters CD Tools and Repositories DiscoverableandAccessibleData ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  18. 18. PUTTING EVERYTHING TOGETHER 18 Data Science, Model Building Training Data Source Code + Executables Model Evaluation Productionize Model Integration Testing Deployment Test Data Model + parameters CD Tools and Repositories DiscoverableandAccessibleData ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  19. 19. PUTTING EVERYTHING TOGETHER 19 Data Science, Model Building Training Data Source Code + Executables Model Evaluation Productionize Model Integration Testing Deployment Test Data Model + parameters CD Tools and Repositories DiscoverableandAccessibleData Monitoring ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 Production Data
  20. 20. PUTTING EVERYTHING TOGETHER 20 Data Science, Model Building Training Data Source Code + Executables Model Evaluation Productionize Model Integration Testing Deployment Test Data Model + parameters CD Tools and Repositories DiscoverableandAccessibleData Monitoring ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 Production Data
  21. 21. WHAT WE WILL USE IN THIS WORKSHOP 21 There are many options for tools and technologies to implement CD4ML ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  22. 22. THE MACHINE LEARNING PROBLEM WE ARE EXPLORING TODAY 22 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  23. 23. A REAL BUSINESS PROBLEM RETAIL / SUPPLY CHAIN Loss of sales, opportunity cost, stock waste, discounting REQUIRES Accurate Demand Forecasting TYPICAL CHALLENGES → Predictions Are Inaccurate → Development Takes A Long Time → Difficult To Adapt To Market Change Pace 23 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 23
  24. 24. SALES FORECASTING FOR GROCERY RETAILER ● 4,000 items ● 50 stores ● 125,000,000 sales transactions ● 4.5 years of data Make predictions based on data from: 24 TASK: Predict how many of each product will be purchased in each store on a given date ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  25. 25. THE SIMPLIFIED WEB APPLICATION As a buyer, I want to be able to choose a product and predict how many units the product will sell at a future date. 25 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  26. 26. EXERCISE 1: SETUP https://github.com/ThoughtWorksInc/continuous-intelligence-workshop 26 ● Click on instructions → 1-setup.md ● Follow the steps to setup your local development environment User ID assignment sheet: http://bit.ly/cd4ml-strata19 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 26
  27. 27. DEPLOYMENT PIPELINE Automates the process of building, testing, and deploying applications to production 27 Application code in version control repository Container image as deployment artifact Deploy container to production servers ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  28. 28. 28 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 An Open Source Continuous Delivery server to model and visualise complex workflows
  29. 29. Pipeline Group ANATOMY OF A GOCD PIPELINE 29 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  30. 30. EXERCISE 2: DEPLOYMENT PIPELINE https://github.com/ThoughtWorksInc/continuous-intelligence-workshop 30 ● Click on instructions → 2-deployment-pipeline.md ● Follow the steps to setup your deployment pipeline ● GoCD URL: https://gocd.cd4ml.net ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 30
  31. 31. BUT NOW WHAT? ● How do we retrain the model more often? ● How to deploy the retrained model to production? ● How to make sure we don’t break anything when deploying? ● How to make sure that our modeling approach or parameterization is still the best fit for the data? ● How to monitor our model “in the wild”? Once your model is in production... 31 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 31
  32. 32. BASIC DATA SCIENCE WORKFLOW 32 Gather data and extract features Separate into training and validation sets Train model and evaluate performance ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  33. 33. SALES FORECAST MODEL TRAINING PROCESS 33 Data splitter.p y Training Data Validation Data decision_t ree.py model.pkl evaluation.py metrics.json download_d ata.py ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  34. 34. CHALLENGE 1: THESE ARE LARGE FILES! 34 Data splitter.p y Training Data Validation Data decision_t ree.py model.pkl evaluation.py metrics.json download_d ata.py ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  35. 35. CHALLENGE 2: AD-HOC MULTI-STEP PROCESS 35 Data splitter.p y Training Data Validation Data decision_t ree.py model.pkl evaluation.py metrics.json download_d ata.py ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  36. 36. ● dvc is git porcelain for storing large files using cloud storage ● dvc connects model training steps to create reproducible workflows SOLUTION: dvc data science version control 36 master change-max-depth try-random-forest model.pkl decision_tree.py model.pkl.dvc ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  37. 37. ANATOMY OF A DVC COMMAND This runs a command and creates a .dvc file. The dvc file points to the dependencies. The output files are versioned and stored in the cloud by running dvc push. When you use the output files (store47-2016.csv) as dependencies for the next step, a is automatically created. You can re-execute an entire pipeline with one command: dvc repro 37 dvc run -d src/download_data.py -o data/raw/store47-2016.csv python src/download_data.py ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 37
  38. 38. EXERCISE 3: MACHINE LEARNING PIPELINE https://github.com/ThoughtWorksInc/continuous-intelligence-workshop 38 ● Click on instructions → 3-machine-learning-pipeline.md ● Follow the steps on your local development environment and in GoCD to create your Machine Learning pipeline ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 38
  39. 39. HOW DO WE TRACK EXPERIMENTS? ● Which experiments and hypothesis are being explored? ● Which algorithms are being used in each experiment? ● Which version of the code was used? ● How long does it take to run each experiment? ● What parameter and hyperparameters were used? ● How fast are my models learning? ● How do we compare results from different runs? We need to track the scientific process and evaluate our models: 39 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 39
  40. 40. 40 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 An Open Source platform for managing end-to-end machine learning lifecycle
  41. 41. EXERCISE 4: TRACKING EXPERIMENTS https://github.com/ThoughtWorksInc/continuous-intelligence-workshop 41 ● Click on instructions → 4-tracking-experiments.md ● Follow the steps to track ML training in mlflow ● MLflow URL: https://mlflow.cd4ml.net ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 41
  42. 42. HOW TO LEARN CONTINUOUSLY? ● Track model usage ● Track model inputs to find training-serving skew ● Track model outputs ● Track model interpretability outputs to identify potential bias or overfit ● Track model fairness to understand how it behaves against dimensions that could introduce unfair bias We need to capture production data to improve our models: 42 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 42
  43. 43. EFK STACK Monitoring and Observability infrastructure 43 Open Source data collector for unified logging Open Source Search Engine Open Source web UI to explore and visualise data ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  44. 44. 44 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 An Open Source UI that makes it easy to explore and visualise the data index in Elasticsearch
  45. 45. EXERCISE 5: MODEL MONITORING https://github.com/ThoughtWorksInc/continuous-intelligence-workshop 45 ● Click on instructions → 5-model-monitoring.md ● Follow the steps to log prediction events ● Kibana URL: https://kibana.cd4ml.net ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 45
  46. 46. 46 SUMMARY - WHAT HAVE WE LEARNED? ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  47. 47. CD4ML ● Proper data/model versioning tools enable reproducible work to be done in parallel. ● No need to maintain complex data processing/model training scripts. ● We can then put data science work into a Continuous Delivery workflow. ● Result: Continuous, on-demand AI development and deployment, from research to production, with a single command. ● Benefit: production AI systems that are always as smart as your data science team. 47 ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  48. 48. 4848 THANK YOU! Danilo Sato (dsato@thoughtworks.com) Christoph Windheuser (cwindheu@thoughtworks.com) Emily Gorcenski (egorcens@thoughtsworks.com) Arif Wider (awider@thoughtworks.com) ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019 join.thoughtworks.com
  49. 49. WHAT DO WE NEED IN OUR STACK? 49 Doing CD with Machine Learning is still a hard problem MODEL PERFORMANCE ASSESSMENT TRACKING BUSINESS VALUE ASSESSMENT VERSION CONTROL (FOR CODE AND MODELS AND DATA) DEPLOYMENT, MONITORING, AND LOGGING ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  50. 50. VERSION CONTROL & COLLABORATION Our code, data, and models should be versioned and shareable without unnecessary work. ■ Data and models can be very large ■ Data can vary invisibly ■ Data scientists need to share work and it must be repeatable What are the challenges? 50 ■ Large artifacts stored in arbitrary storage linked to source repo ■ Data scientists can encode work process at repeat in one step What does the ideal solution look like? ■ Storage: S3, HDFS, etc ■ Git LFS ■ Shell scripts ■ dvc ■ Pachyderm ■ jupyterhub What solutions are out there now? ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019
  51. 51. MODEL PERFORMANCE TRACKING We should be able to scale model development to try multiple modeling approaches simultaneously. ■ Hyperparameter tuning and model selection is hard ■ Tracking performance depends on other moving parts (e.g. data) What are the challenges? 51 ■ Links models to specific training sets and parameter sets ■ Is differentiable ■ Allows visualization of results What does the ideal solution look like? ■ dvc ■ MLFlow ■ No shortage of options here What solutions are out there now? ©ThoughtWorks 2019 Strata Data Conference London, April 30, 2019

×