Training and deploying ML models with Google Cloud Platform
In this presentation, Maciej presented some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. Maciej discussed also which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform
Presentation by Maciej Pieńkosz from Sotrender at Data Science Summit 2020
Generative AI on Enterprise Cloud with NiFi and Milvus
Training and deploying ML models with Google Cloud Platform
1. Training and deploying ML models with
Google Cloud Platform
Maciej Pieńkosz
Data Science Summit 2020 1
2. ML in Sotrender
• Sotrender - a platform for analyzing comunication on your
Social Media
• Our models:
– Sentiment
– Hatespeech
– Topic modelling
– Keyphrase extractor
– NER (brands and products)
– Image Tagger
– Logo Detector
– ….
2
4. Modeling with AI Notebooks
1. AI Platform Notebooks is used for initial data exploration and modeling
2. Allows to quickly start working on new problem without worrying about infrostructure
3. For the start, we favor faster, simpler model architectures that can be easily built,
validated, iterated and eventually deployed (usually on CPU)
4. Experiment tracking: MlFlow
4
https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html
https://cloud.google.com/ai-platform-notebooks?hl=id
5. Structuring training code
• Notebooks disadvantages:
– You pay for the whole time the notebook is running
– Code quality is usually lower
– Hard to parametrize, unit test, and review
• After initial experimentation phase, we try to give more structure to
the model training code:
– Refactor codebase to Python packages and modules and move
to git repository (Gitlab)
– Add tests (more on it later)
– Wrap code into a Docker container
– Use dedicated AI Platform Training service to train in the cloud
5
https://www.jeremyjordan.me/ml-projects-guide/
6. AI Platform Training with custom containers
6
• Advantages:
– Develop locally, train in the cloud
– We are billed only for the actual training
time
– Broad hardware configuration options
(e.g. GPU type)
– Job statuses and logs for historical runs
are available in the dashboard
– Support for hyperparameter tuning
almost out-of-the box
Training job dockerfile
Cloud training script
7. Google Storage for Models and Datasets
• We use Google Storage as primary Store for models
and datasets
• One bucket per model
• We follow strict, unified bucket, directory and file
structure, same for every model
– Raw data
– Combined datasets, with predefined splits
– Model files
• Documentation in Knowledge Base (Confluence)
• One can use dedicated systems like DVC, Quilt
7
8. Model deployment
• Your options:
– Online
– Batch (offline)
• Our approach is to deploy models as services
– Easy to integrate
– Easy to use by other teams
• We serve them as REST service with Flask (or, most
recently, FastApi)
• We wrap them in Docker containers so they can be
easily deployed to cloud and serve with Cloud Run
8
https://mlinproduction.com/batch-inference-vs-online-inference/
Online inference
Batch inference
9. Cloud Deployment: Cloud Run
• We use Cloud Run to deploy our model services
• Cloud Build for delegating build process to GCP
• GCP has dedicated service for serving models, AI
Platform Prediction, but we use Cloud Run
– It is more flexible for us, we can set up any
environment and add any dependencies
– AI Predictions has limits regarding model
size
– We can add additional endpoints (e.g.
/explain to services)
9
Service dockerfile
Cloud deployment script
11. Delivery pipeline automation (CI/CD)
11
• Configured for the models that we use on bigger, production scale
• Implemented in Gitlab CI/CD
push Download files
Build image
Run tests
Run static analysis
Push image to registry
Code Review
Canary
rollout
deploy
12. Monitoring
• System level metrics:
– Resource consumption (RAM, CPU), healthchecks, status codes, latency, etc.
• Data level metrics
– Prediction distributions, input data distributions
– System performance against real time labels (collected automatically or manually)
12
https://mlinproduction.com/
13. Streamlit
• https://www.streamlit.io/
• Easy tool to create simple web Data Products directly in Python
• You can use it to create Demos, share your work, showcase your models behaviour, debug
• Very intuitive, no Web skills required
13
https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace