Machine Learning Model Training and Deployment with Google Cloud Platform

Trenowanie i wdrażanie modeli uczenia
maszynowego z wykorzystaniem GCP
Maciej Pieńkosz
Data Science Summit 2020 1

Our models
1. Sentiment
2. Hatespeech
3. Topic modelling
4. Keyphrase extractor
5. NER (brands and products)
6. Image Tagger
7. Text Extractor
8. Logo Detector
9. Post Classiﬁer
10. ….
3

ML models lifecycle
1. Planning and project setup
2. Data collection and labeling
3. Modeling and exploration
4. Model training and reﬁnement
5. Testing and evaluation
6. Model deployment
7. Ongoing model maintenance and monitoring
4
https://www.jeremyjordan.me/ml-projects-guide/

Modeling with AI Notebooks
1. We use Google Cloud Platform as our cloud provider
2. AI Platform Notebooks is used for initial data exploration and modeling
3. For the start, we favor faster, simpler model architectures that can be easily built,
validated, iterated and eventually deployed (usually on CPU)
4. Experiment tracking: MlFlow
5
https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html
https://cloud.google.com/ai-platform-notebooks?hl=id

Structuring training code
• Notebooks disadvantages:
– You pay for the whole time the notebook is running
– Code quality is usually lower
– Hard to parametrize, unit test, and review
• After initial experimentation phase, we try to give more structure to
the model training code:
– Refactor codebase to Python packages and modules and move
to git repository (Gitlab)
– Add tests (more on it later)
– Wrap code into a Docker container
– Use dedicated AI Platform Training service to train in the cloud
6
https://www.jeremyjordan.me/ml-projects-guide/

AI Platform Training with custom containers
7
• Advantages:
– Develop locally, train in the cloud
– Pay only for the time of training
– Broad conﬁguration options
– Job statuses and logs for historical runs
are available in the dashboard
– Easy integration with hyperparameter
tuning
Training job dockerﬁle
Cloud training script

Google Storage for Models and Datasets
• We use Google Storage as primary Store for models
and datasets
• One bucket per model
• We follow unified bucket and directory structure,
same for every model
– Raw data
– Combined datasets, with predefined splits
– Model files
• Documentation in Knowledge Base (Confluence)
• One can use dedicated systems like DVC, Quilt
8

Additional training tips
• Consider having two validation sets: training-dev and
test-dev, to distinguish between overﬁtting errors and
distribution shift
• Establish human performance for your task
• Evaluate your model performance on important data
slices
• Do hyperparameter tuning; utilize open source packages
e.g. hyperopt
• Develop a systematic way of analyzing model errors
Recommended resources:
• https://www.coursera.org/learn/machine-learning-projects
• https://www.deeplearning.ai/machine-learning-yearning/
9
https://towardsdatascience.com/some-strategies-for-machine-learning-projects-5f2f32c34635

Model deployment
• Your options:
– Online
– Batch (ofﬂine)
• Our approach is to deploy models as services
– Easy to integrate
– Easy to use by other teams
• We serve them as REST service with Flask (or, most
recently, FastApi)
• We wrap them in Docker containers so they can be
easily deployed to cloud and serve with Cloud Run
10
https://mlinproduction.com/batch-inference-vs-online-inference/
Online inference
Batch inference

Cloud Deployment: Cloud Run
• We use Cloud Run to deploy our model services
• Cloud Build for delegating build process to GCP
• GCP has dedicated service for serving models, AI
Platform Prediction, but we use Cloud Run
– It is more ﬂexible for us, we can set up any
environment and add any dependencies
– AI Predictions has limits regarding model
size
– We can add additional endpoints (e.g.
/explain to services)
11
Service dockerﬁle
Cloud deployment script

Cloud Run c.d.
• Useful features out-of-the box
– Autoscaling
– Multiple Revisions (versions), easy Rollback
– Trafﬁc management
– Multiple Namespaces (dev, prod)
– Resource Monitoring
12

Delivery pipeline automation (CI/CD)
13
• Implemented in Gitlab CI/CD
push Download files
Build image
Run tests
Run static analysis
Push image to registry
Code Review
Canary
rollout
deploy

Testing and evaluation
• Unit and integration tests for:
– Input pipelines
– Preprocessing functions
• “Regression” tests for:
– Performance on validation data
– Predictions on some important, hand-picked examples
– Performance on data slices
14

Monitoring
• System level metrics:
– Resource consumption (RAM, CPU), healthchecks, status codes, latency, etc.
• Data level metrics
– Prediction distributions, input data distributions
– System performance against real time labels (collected automatically or manually)
15
https://mlinproduction.com/

Streamlit
• https://www.streamlit.io/
• Easy tool to create simple web Data Products directly in Python
• You can use it to create Demos, share your work, showcase your models behaviour, debug
• Very intuitive, no Web skills required
16
https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace

Machine Learning Model Training and Deployment with Google Cloud Platform

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (12)

Semelhante a Machine Learning Model Training and Deployment with Google Cloud Platform

Semelhante a Machine Learning Model Training and Deployment with Google Cloud Platform (20)

Mais de Sotrender

Mais de Sotrender (20)

Último

Último (20)

Machine Learning Model Training and Deployment with Google Cloud Platform