O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

CD in Machine Learning Systems

254 visualizações

Publicada em

This talk will focus on Techniques, metrics and different tests (code, models, infra and features/data) that help the developers of machine learning systems to achieve CD.

Publicada em: Software
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

CD in Machine Learning Systems

  1. 1. CD in Machine Learning systems Juan López @juaneto
  2. 2. Goals and structure
  3. 3. Continuous deployment What it is and why everybody wants it Idea Develop Deploy in prod
  4. 4. Continuous deployment What it is and why everybody wants it Idea Develop Deploy in prod ● New features on the fly.
  5. 5. Continuous deployment What it is and why everybody wants it Idea Develop Deploy in prod ● New features on the fly. ● Quality goes up (smaller changes).
  6. 6. Continuous deployment What it is and why everybody wants it Idea Develop Deploy in prod ● New features on the fly. ● Quality goes up (smaller changes). ● Faster development.
  7. 7. Continuous deployment What it is and why everybody wants it Idea Develop Deploy in prod ● New features on the fly. ● Quality goes up (smaller changes). ● Faster development. ● Experimentation.
  8. 8. Continuous deployment Idea Develop Deploy in prod What it is and why everybody wants it ● New features on the fly. ● Quality goes up (smaller changes). ● Faster development. ● Experimentation. ● Innovation.
  9. 9. So… we want to reduce the gap between a new idea and when this idea is in production.
  10. 10. Machine learning Where do we use it? Not only hype
  11. 11. Machine learning Where do we use it? Not only hype ● Image recognition ● Recommendations ● Predictions ● etc.
  12. 12. Machine learning What is it?
  13. 13. Machine learning What is it? ● Subset of artificial intelligence.
  14. 14. Machine learning What is it? ● Subset of artificial intelligence. ● Statistical models that systems use to effectively perform a specific task.
  15. 15. Machine learning What is it? ● Subset of artificial intelligence. ● Statistical models that systems use to effectively perform a specific task. ● It doesn´t use explicit instructions, relying on patterns and inference instead.
  16. 16. So… we want to reduce the gap between a new idea and when this idea is in production.
  17. 17. How do we achieve CD?
  18. 18. 2017 The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley Google, Inc. How do we achieve CD?
  19. 19. How do we achieve CD?
  20. 20. How do we achieve CD?
  21. 21. Machine Learning Systems
  22. 22. ModelCode Data Production Monitoring
  23. 23. Code ModelCode Data Production Monitoring
  24. 24. Code Apply the best practices for writing your code. Code is always code
  25. 25. Code Apply the best practices for writing your code. Code is always code ● Not only model. Complex systems.
  26. 26. Code Apply the best practices for writing your code. Code is always code ● Not only model. Complex systems. ● Extreme programming.
  27. 27. Code Apply the best practices for writing your code. Code is always code ● Not only model. Complex systems. ● Extreme programming. ● Quality gates.
  28. 28. Code Apply the best practices for writing your code. Code is always code ● Not only model. Complex systems. ● Extreme programming. ● Quality gates. ● Feature toggles.
  29. 29. Code Apply the best practices for writing your code. Code is always code ● Not only model. Complex systems. ● Extreme programming. ● Quality gates. ● Feature toggles. ● Test Pyramid. Manual session based testing Automated GUI tests Automated unit tests Automated integration tests Automated API tests Automated component tests * Vishal Naik (Thoughtworks insights)
  30. 30. Builds Test Continuous integration Acceptance test Deploy to staging Continuous delivery Deploy to pro Smoke test Continuous deployment Code pipeline
  31. 31. Unlike in traditional software systems, the ¨behavior of ML systems is not specified directly in code but is learned from data¨.
  32. 32. Unlike in traditional software systems, the ¨behavior of ML systems is not specified directly in code but is learned from data¨. So our tests depend on the sets of data for training models.
  33. 33. ModelCode Data Production Monitoring Data
  34. 34. Data pipeline
  35. 35. Ingest
  36. 36. Ingest ● Data lake
  37. 37. Ingest ● Data lake ● Know your sources. Data Catalog.
  38. 38. Ingest ● Data lake ● Know your sources. Data Catalog. ● Have a schema. Governance your data.
  39. 39. Ingest ● Data lake ● Know your sources. Data Catalog. ● Have a schema. Governance your data. ● Watch for silent failures.
  40. 40. Data wrangling/mungling
  41. 41. Data wrangling/mungling ● Datamart (not data warehouse).
  42. 42. Data wrangling/mungling ● Datamart (not data warehouse). ● Be careful with data cooking: if your features are bad, everything is bad.
  43. 43. Data wrangling/mungling ● Datamart (not data warehouse). ● Be careful with data cooking: if your features are bad, everything is bad. ● Data cleaning
  44. 44. Get training data
  45. 45. Get training data ● data scientist. Make their life easier.
  46. 46. Get training data ● data scientist. Make their life easier. ● Big data. Importance-weight sampled.
  47. 47. Get training data ● data scientist. Make their life easier. ● Big data. Importance-weight sampled. ● Data security.
  48. 48. Get training data ● data scientist. Make their life easier. ● Big data. Importance-weight sampled. ● Data security. ● Versioning data.
  49. 49. ● data scientist. Make their life easier. ● Big data. Importance-weight sampled. ● Data security. ● Versioning data. ● Training/Serving Skew. Get training data
  50. 50. “All models are wrong”. Common aphorism in Statistics.
  51. 51. “All models are wrong”. Common aphorism in Statistics. ”All models are wrong, some are useful”. George Box.
  52. 52. “All models are wrong”. Common aphorism in Statistics. ”All models are wrong, some are useful”. George Box. ”All models are wrong, some are useful for a short period of time”. Tensorflow´s team.
  53. 53. Model ModelCode Data Production Monitoring
  54. 54. First of all
  55. 55. First of all ● Design & evaluate the reward function.
  56. 56. First of all ● Design & evaluate the reward function. ● Define errors & failure.
  57. 57. First of all ● Design & evaluate the reward function. ● Define errors & failure. ● Ensure mechanisms for user feedback.
  58. 58. First of all ● Design & evaluate the reward function. ● Define errors & failure. ● Ensure mechanisms for user feedback. ● Try to tie model changes to a clear metric of the subjective user experience.
  59. 59. ● Design & evaluate the reward function. ● Define errors & failure. ● Ensure mechanisms for user feedback. ● Try to tie model changes to a clear metric of the subjective user experience. ● Objective vs many metrics. First of all
  60. 60. Model pipeline
  61. 61. Code new model candidate
  62. 62. ● Code is code. Code new model candidate
  63. 63. ● Code is code. ● Run test in your pipeline. Code new model candidate
  64. 64. ● Code is code. ● Run test in your pipeline. ● New version of the model. Code new model candidate
  65. 65. Training model
  66. 66. Training model ● Feature engineering. (Unbalancing data, unknown unknowns, etc).
  67. 67. Training model ● Feature engineering. (Unbalancing data, unknown unknowns, etc). ● Be critical with your features: data dependencies cost more than code dependencies.
  68. 68. Training model ● Feature engineering. (Unbalancing data, unknown unknowns, etc). ● Be critical with your features: data dependencies cost more than code dependencies. ● Training/serving Skew.
  69. 69. Training model ● Feature engineering. (Unbalancing data, unknown unknowns, etc). ● Be critical with your features: data dependencies cost more than code dependencies. ● Training/serving Skew. ● Deterministic training dramatically simplifies.
  70. 70. Training model ● Feature engineering. (Unbalancing data, unknown unknowns, etc). ● Be critical with your features: data dependencies cost more than code dependencies. ● Training/serving Skew. ● Deterministic training dramatically simplifies. ● Tune hyperparameters.
  71. 71. Model competition PRODUCTION Model in PRO Model 1 Model 2 Model n Model in PRO
  72. 72. Model performance
  73. 73. Model performance ● Test performance with production data.
  74. 74. Model performance ● Test performance with production data. ● Check your reward functions and failures. E.g: ROC curve.
  75. 75. Model performance ● Test performance with production data. ● Check your reward functions and failures. E.g: ROC curve. ● Be careful. Satisfy a baseline of quality in all data slices.
  76. 76. Model performance ● Test performance with production data. ● Check your reward functions and failures. E.g: ROC curve. ● Be careful. Satisfy a baseline of quality in all data slices. ● Baseline of accuracy.
  77. 77. Model performance ● Test performance with production data. ● Check your reward functions and failures. E.g: ROC curve. ● Be careful. Satisfy a baseline of quality in all data slices. ● Baseline of accuracy. ● Feedback loop.
  78. 78. Model champion PRODUCTION Model in PRO Model 2 Model n Model in PRO Model 1
  79. 79. Deploy champion model
  80. 80. Deploy champion model ● Shadow traffic.
  81. 81. Deploy champion model ● Shadow traffic. ● Test the models with real data.
  82. 82. Deploy champion model ● Shadow traffic. ● Test the models with real data. ● Canary releases.
  83. 83. Deploy champion model ● Shadow traffic. ● Test the models with real data. ● Canary releases. ● Tests A/B.
  84. 84. Deploy champion model ● Shadow traffic. ● Test the models with real data. ● Canary releases. ● Tests A/B. ● Rollbacks.
  85. 85. Monitoring ...because shit happens
  86. 86. Monitoring ModelCode Data Production Monitoring
  87. 87. Monitoring
  88. 88. Monitoring ● Create a dashboard with clear and useful information.
  89. 89. Monitoring ● Create a dashboard with clear and useful information. ● Schema changes.
  90. 90. Monitoring ● Create a dashboard with clear and useful information. ● Schema changes. ● Infra monitoring (training speed, serving latency, RAM usage, etc).
  91. 91. Monitoring ● User feedback.
  92. 92. Monitoring ● User feedback. ● Stale models.
  93. 93. Monitoring ● User feedback. ● Stale models. ● Feedback loop.
  94. 94. Monitoring ● User feedback. ● Stale models. ● Feedback loop. ● Errors (model, apis, etc).
  95. 95. Monitoring ● User feedback. ● Stale models. ● Feedback loop. ● Errors (model, apis, etc). ● Silent failures.
  96. 96. Conclusions
  97. 97. ● Code is always code ● Objective driven modeling ● Know your data ● Clear metrics for complex systems
  98. 98. Juan López @juaneto Thank you

×