Okej, mam już mój świetny model w Notebooku, co dalej? Większość kursów i źródeł dotyczących uczenia maszynowego dobrze przygotowuje nas do implementacji algorytmów uczenia maszynowego i budowy mniej lub bardziej skomplikowanych modeli. Jednak w większości przypadków model jest jedynie małym fragmentem większego systemu, a jego wdrożenie i utrzymywanie okazuje się w praktyce procesem czasochłonnym i generującym rozmaite błędy. Problem potęguje się kiedy mamy do sproduktyzowania nie jeden, a więcej modeli. Choć z roku na rok powstaje coraz więcej narzędzi i platform do usprawnienia tego procesu, jest to zagadnienie któremu wciąż poświęca się stosunkowo mało uwagi.
W mojej prezentacji przedstawię jakich podejść, dobrych praktyk oraz narzędzi i usług Google Cloud Platform używamy w Sotrender do efektywnego trenowania i produktyzacji naszych modeli ML, służących do analizy danych z mediów społecznościowych. Omówię na które aspekty DevOps zwracamy uwagę w kontekście wytwarzania produktów opartych o modele ML (MLOps) i jak z wykorzystaniem Google Cloud Platform można je w łatwy sposób wdrożyć w swoim startupie lub firmie.
Prezentacja Macieja Pieńkosza z Sotrendera poczas Data Science Summit 2020
3. Our models
1. Sentiment
2. Hatespeech
3. Topic modelling
4. Keyphrase extractor
5. NER (brands and products)
6. Image Tagger
7. Text Extractor
8. Logo Detector
9. Post Classifier
10. ….
3
4. ML models lifecycle
1. Planning and project setup
2. Data collection and labeling
3. Modeling and exploration
4. Model training and refinement
5. Testing and evaluation
6. Model deployment
7. Ongoing model maintenance and monitoring
4
https://www.jeremyjordan.me/ml-projects-guide/
5. Modeling with AI Notebooks
1. We use Google Cloud Platform as our cloud provider
2. AI Platform Notebooks is used for initial data exploration and modeling
3. For the start, we favor faster, simpler model architectures that can be easily built,
validated, iterated and eventually deployed (usually on CPU)
4. Experiment tracking: MlFlow
5
https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html
https://cloud.google.com/ai-platform-notebooks?hl=id
6. Structuring training code
• Notebooks disadvantages:
– You pay for the whole time the notebook is running
– Code quality is usually lower
– Hard to parametrize, unit test, and review
• After initial experimentation phase, we try to give more structure to
the model training code:
– Refactor codebase to Python packages and modules and move
to git repository (Gitlab)
– Add tests (more on it later)
– Wrap code into a Docker container
– Use dedicated AI Platform Training service to train in the cloud
6
https://www.jeremyjordan.me/ml-projects-guide/
7. AI Platform Training with custom containers
7
• Advantages:
– Develop locally, train in the cloud
– Pay only for the time of training
– Broad configuration options
– Job statuses and logs for historical runs
are available in the dashboard
– Easy integration with hyperparameter
tuning
Training job dockerfile
Cloud training script
8. Google Storage for Models and Datasets
• We use Google Storage as primary Store for models
and datasets
• One bucket per model
• We follow unified bucket and directory structure,
same for every model
– Raw data
– Combined datasets, with predefined splits
– Model files
• Documentation in Knowledge Base (Confluence)
• One can use dedicated systems like DVC, Quilt
8
9. Additional training tips
• Consider having two validation sets: training-dev and
test-dev, to distinguish between overfitting errors and
distribution shift
• Establish human performance for your task
• Evaluate your model performance on important data
slices
• Do hyperparameter tuning; utilize open source packages
e.g. hyperopt
• Develop a systematic way of analyzing model errors
Recommended resources:
• https://www.coursera.org/learn/machine-learning-projects
• https://www.deeplearning.ai/machine-learning-yearning/
9
https://towardsdatascience.com/some-strategies-for-machine-learning-projects-5f2f32c34635
10. Model deployment
• Your options:
– Online
– Batch (offline)
• Our approach is to deploy models as services
– Easy to integrate
– Easy to use by other teams
• We serve them as REST service with Flask (or, most
recently, FastApi)
• We wrap them in Docker containers so they can be
easily deployed to cloud and serve with Cloud Run
10
https://mlinproduction.com/batch-inference-vs-online-inference/
Online inference
Batch inference
11. Cloud Deployment: Cloud Run
• We use Cloud Run to deploy our model services
• Cloud Build for delegating build process to GCP
• GCP has dedicated service for serving models, AI
Platform Prediction, but we use Cloud Run
– It is more flexible for us, we can set up any
environment and add any dependencies
– AI Predictions has limits regarding model
size
– We can add additional endpoints (e.g.
/explain to services)
11
Service dockerfile
Cloud deployment script
13. Delivery pipeline automation (CI/CD)
13
• Implemented in Gitlab CI/CD
push Download files
Build image
Run tests
Run static analysis
Push image to registry
Code Review
Canary
rollout
deploy
14. Testing and evaluation
• Unit and integration tests for:
– Input pipelines
– Preprocessing functions
• “Regression” tests for:
– Performance on validation data
– Predictions on some important, hand-picked examples
– Performance on data slices
14
15. Monitoring
• System level metrics:
– Resource consumption (RAM, CPU), healthchecks, status codes, latency, etc.
• Data level metrics
– Prediction distributions, input data distributions
– System performance against real time labels (collected automatically or manually)
15
https://mlinproduction.com/
16. Streamlit
• https://www.streamlit.io/
• Easy tool to create simple web Data Products directly in Python
• You can use it to create Demos, share your work, showcase your models behaviour, debug
• Very intuitive, no Web skills required
16
https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace