This fourth meetup will present good practices and tips about deploying a recommender system in production. We will cover a wide range of the day-to-day of machine learning engineers and devops: from test-driven development to continuous integration and cloud architecture design. We will see how machine learning and recommender system in particular differ from traditional software development, and how this impacts deployment pipelines, and what tools you can use to solve this problem.
9. Recommender Systems – Previous Meetups
(1) With started with the dataset.
(2) Then we trained models.
(3) We selected the best one.
(4) And now, we want to deploy the model to production!
10. Recommender Systems Datasets – Previous Meetups
Explicit feedback
(users’ ratings)
Implicit feedback
(users’ clicks)
Explicit feedback Implicit feedback
Example Domains Movies, Tv-Shows, Music Marketplaces, Businesses
Example Data type Like/Dislike, Stars Clicks, Play-time, Purchases
Complexity Clean, Costly, Easy to interpret Dirty, Cheap, Difficult to interpret
13. Recommender Systems Models – Previous Meetups
users embeddings
items embeddings
model weights
Example of model we want to deploy
14. Recommender Systems Models – This Meetup
User Embeddings Item Embeddings
Model Weights
Alice Star Wars
4.5/5
15. Software vs ML
Traditional Software Machine Learning Software
Stateless Stateful
Explicit Specifications No Specifications
Rule-based Logic from Code Model-based Logic from Data
18. Key Objectives
Highly scalable, highly available
Large number of users and items
No downtime
Continuous user updates
New user/item interactions (e.g. ratings, clicks, watch)
19. Key Objectives
Highly scalable, highly available
Large number of users and items
No downtime
Continuous user updates
New user/item interactions (e.g. ratings, clicks, watch)
Frequent item updates and new items
New items are added continuously
20. Key Objectives
Highly scalable, highly available
Large number of users and items
No downtime
Continuous user updates
New user/item interactions (e.g. ratings, clicks, watch)
Frequent item updates and new items
New items are added continuously
Non-trivial model
Can’t be built in e.g. SQL or ElasticSearch
21. Assumptions
Not Too Big
Can be trained on big enough ML server
No need for distributed ML
Less than 1TB of data, so less than 1B items
22. Assumptions
Not Too Big
Can be trained on big enough ML server
No need for distributed ML
Less than 1TB of data, so less than 1B items
vs vs vs
23. Assumptions
Not Too Big
Can be trained on big enough ML server
No need for distributed ML
Less than 1TB of data, so less than 1B items
Model Updates Can Be Delayed
You do not need to update the entire model several times per hour in live
vs vs vs
24. Assumptions
Not Too Big
Can be trained on big enough ML server
No need for distributed ML
Less than 1TB of data, so less than 1B items
Model Updates Can Be Delayed
You do not need to update the entire model several times per hour in live
(Adding new items in real time may be supported though)
vs vs vs
25. Assumptions
Not Too Big
Can be trained on big enough ML server
No need for distributed ML
Less than 1TB of data, so less than 1B items
Model Updates Can Be Delayed
You do not need to update the entire model several times per hour in live
(Adding new items in real time may be supported though)
Recommendations Are Not Stored
You do not need to save recommendations
vs vs vs
26. Cloud Infrastructure – Plan
Users Embeddings Storage
Items Embeddings Storage
Model Weights Storage
27. Users Embeddings Storage
User Embeddings
Binary blob <1KB
Keep changing
Need only one per user request
Key-Value Store
Fetch the user embeddings from the cloud at each request
Atomic key-value store like redis will handle concurrency for free
28. Items Embeddings – The Big Issue
Network Problem
If the items data is stored in a database like SQL,
and the model is too complex to be expressed in SQL:
then you need to fetch 100% of the items data from the DB to your compute instance
...at each request!
Rule of Thumb
1M items * 1KB each → 1GB total data
29. Items Embeddings – The Big Issue
Network Problem
If the items data is stored in a database like SQL,
and the model is too complex to be expressed in SQL:
then you need to fetch 100% of the items data from the DB to your compute instance
...at each request!
Rule of Thumb
1M items * 1KB each → 1GB total data
(1B items * 1KB each → 1TB total data)
30. Network Issue – Solution
Items DB
Model
Predictions
Users DB
Top K
Candidates
Generation
What should I
watch?
Star Wars!
Solution!
31. Items Candidate Generation
Goal: pre-select thousand(s) of items for your model, without need to see all embeddings
Model-Free
E.g. kNN item-item with pre-computed kNN tables
Easy with ML-ready DB like Spark
Do-able in ElasticSearch or even SQL
Model-Based
E.g. linear matrix factorization, then smart implementation of Top-K
Model has to be monotonic (w.r.t. items dimensions) otherwise you can’t rely on pre-computed index
Ref: ElasticSearch with word2vec embeddings http://bit.ly/2Wciike
32. Items Candidate Generation
Goal: pre-select thousand(s) of items for your model, without need to see all embeddings
Model-Free
E.g. kNN item-item with pre-computed kNN tables
Easy with ML-ready DB like Spark
Do-able in ElasticSearch or even SQL
Model-Based
E.g. linear matrix factorization, then smart implementation of Top-K
Model has to be monotonic (w.r.t. items dimensions) otherwise you can’t rely on pre-computed index
Ref: ElasticSearch with word2vec embeddings http://bit.ly/2Wciike (WIP without efficient Top-K [issue#42326])
35. Throughput Problem and Solution
Throughput Problem
You can’t get thousands of arbitrary items embeddings at each request
The physical limit is DB CPU, not network speed
Hosting a local redis co-located in the same physical machine as your process does not help
Throughput Solution
36. Throughput Problem and Solution
Throughput Problem
You can’t get thousands of arbitrary items embeddings at each request
The physical limit is DB CPU, not network speed
Hosting a local redis co-located in the same physical machine as your process does not help
Throughput Solution (DevOps Nightmares!)
🎃🧛🕷 You need to keep items embeddings in memory of your processes 🕷🧛🎃
37. Items Embeddings Storage
Cloud Storage
Read once when spawning your processes
Fully updated every time you deploy a new model
<1M items → static file storage works well (e.g. AWS S3, Google Storage)
In-Memory Replica
Loading pre-fork enabled copy-on-write → can share read-only data with many processes for free
Otherwise shared-memory but with concurrency issues
38. Items Embeddings Storage
Cloud Storage
Read once when spawning your processes
Fully updated every time you deploy a new model
<1M items → static file storage works well (e.g. AWS S3, Google Storage)
>1B items → big data storage (AWS RedShift, Google BigQuery, Hadoop/HDFS), updating will be “interesting”
In-Memory Replica
Loading pre-fork enabled copy-on-write → can share read-only data with many processes for free
Otherwise shared-memory but with concurrency issues
>1B items → cannot load everything at init, and require cache-like mechanisms
39. Model Weights Storage
Cloud Storage
Models weights are typically <1MB
Models weights are not updated often
Static file storage works well
In-Memory Replica
Small enough, any strategy will work (duplication, copy-on-write, shared-memory, …)
43. Infrastructure
Items Embs
in S3
Users Embs
in Redis
2
1 What should
I watch? Items Embs
in RAM
Model
Weights
in S3
Items kNN
DB
Model
Weights
in RAM
44. Infrastructure
Items Embs
in S3
Users Embs
in Redis
2
kNN
Candidates
Generation
1 What should
I watch? Items Embs
in RAM
Model
Weights
in S3
Items kNN
DB
Model
Weights
in RAM3
45. Infrastructure
Items Embs
in S3
Users Embs
in Redis
2
kNN
Candidates
Generation
1 What should
I watch? Items Embs
in RAM
Model
Weights
in S3
Items kNN
DB
Model
Weights
in RAM
4
3
46. Infrastructure
Items Embs
in S3
Users Embs
in Redis
Model
Predictions
2
kNN
Candidates
Generation
1 What should
I watch? Items Embs
in RAM
Model
Weights
in S3
Items kNN
DB
Model
Weights
in RAM
4
5
3
47. Infrastructure
Items Embs
in S3
Users Embs
in Redis
Model
Predictions
6
2
Top K
kNN
Candidates
Generation
1 What should
I watch? Items Embs
in RAM
Model
Weights
in S3
Items kNN
DB
Model
Weights
in RAM
4
5
3
48. Infrastructure
Items Embs
in S3
Users Embs
in Redis
Model
Predictions
6
2
Top K
kNN
Candidates
Generation
1 What should
I watch?
7
Star Wars!
Items Embs
in RAM
Model
Weights
in S3
Items kNN
DB
Model
Weights
in RAM
4
5
3
50. Online Updates – Plan
Users Embeddings Update
Items Embeddings Update
Model Weights Update
51. User Updates – The Problem
Goal
Update user embeddings on new user/item interaction
Critical for “session-based” recommendations (e.g. anonymous browsing on retail website)
52. User Updates – The Problem
Goal
Update user embeddings on new user/item interaction
Critical for “session-based” recommendations (e.g. anonymous browsing on retail website)
Technical Issue
To update the user embeddings you usually need all users interactions
Some user may have interacted with thousands of items (e.g. listening history on Spotify)
53. User Updates – The Solution
Have two kinds of user updates, a quick one and a slow one:
Quick “Single-Step” Update
Do one step of online update given only the data for the new interaction
e.g. one step of stochastic gradient descent
Slow “Full” Update
Periodically schedule user updates from scratch, happening in the background
Requires a scheduler or a technology for background tasks
54. User Updates – The Solution
Have two kinds of user updates, a quick one and a slow one:
Quick “Single-Step” Update
Do one step of online update given only the data for the new interaction
e.g. one step of stochastic gradient descent
Slow “Full” Update
Periodically schedule user updates from scratch, happening in the background
Requires a scheduler or a technology for background tasks
(hello “Discover Weekly” 👋)
56. User Updates – Background Tasks
Scheduling / Task Queue Technologies
● Cron
● Celery, RabbitMQ, Redis Pub/Sub
Sharing Memory
Executing the user update requires to load all item embeddings in memory again
● Either you do not share memory, and do not update too often
● Or you do the update in your main workers
● Or you background workers are forks or your main workers (e.g. uwsgi spooling)
57. Item Updates
New Items (Cold Start)
New items don’t have any user interactions
Collaborative-Filtering models do not support adding new items
Content-Based models do support adding new items
Update Items from New Interactions
SVD-based Matrix Factorization supports restricted forms of online-update
Otherwise heuristics might work, such as single-step of gradient
Benefits typically do not worth the DevOps troubles
58. Item Updates
Update Cloud Storage
Typically replace the entire file, or use rsync for smarter in-place updates
Update In-Memory Replica
Not possible with “copy-on-write sharing” → re-deploy all your workers
Do-able with shared memory, but small benefits might not worth the DevOps troubles
59. Model Updates
Goal
Update the model weights based on all new user/item interactions
Deploy new model weights without downtime
60. Model Updates
Goal
Update the model weights based on all new user/item interactions
Deploy new model weights without downtime
Real-Time?
Like for items embeddings, everything is much easier if model weights are read-only
Small benefits of real-time update of the model weights
You probably want to retrain your model from scratch once in a while and re-deploy
61. Model Updates
Goal
Update the model weights based on all new user/item interactions
Deploy new model weights without downtime
Real-Time?
Like for items embeddings, everything is much easier if model weights are read-only
Small benefits of real-time update of the model weights
You probably want to retrain your model from scratch once in a while and re-deploy
(hello again “Discover Weekly” 👋)
63. Testing and CI/CD – Plan
Unit-Testing and Advanced Tests
Versioning
Orchestration
64. Unit-Tests for ML
Difference with Traditional Software
No way to programmatically express the specifications of a ML software
C.f. “Software 2.0” by Andrej Karpathy http://bit.ly/2Ni1apj
65. Unit-Tests for ML
Difference with Traditional Software
No way to programmatically express the specifications of a ML software
C.f. “Software 2.0” by Andrej Karpathy http://bit.ly/2Ni1apj
Small Specific Unit-Tests
Make unit-tests absurdly easy to pass. If they fail, you must be sure you have a bug
Test your code, not the generalization ability of your model:
● well known algorithm? → there is maths and proofs
● heuristic? → expected to fail, this shouldn’t make CI fail
E.g. test that your model can successfully overfit train data when you remove all regularization
67. Advanced Tests for ML
Goal
Test generalization ability of your models
Synthetic Datasets > Real Datasets
Full control of the assumptions
You can make it easy enough so if the test fails, something is wrong
68. Advanced Tests for ML
Goal
Test generalization ability of your models
Synthetic Datasets > Real Datasets
Full control of the assumptions
You can make it easy enough so if the test fails, something is wrong
(disclaimer: this talk is for ML Engineers&Ops 👷♀not ML Researchers 👩🎓)
69. Advanced Tests for ML
Goal
Test generalization ability of your models
Synthetic Datasets > Real Datasets
Full control of the assumptions
You can make it easy enough so if the test fails, something is wrong
(disclaimer: this talk is for ML Engineers&Ops 👷♀not ML Researchers 👩🎓)
Compare Against Baselines
Hard to find good performance thresholds for the tests to pass
Easy to test that your model performs better than simple baselines
70. Example: Model Test
def test_model(self):
""" test that model performs better than constant baseline """
# generate sample data
dataset = self._sample_synthetic_dataset()
# train/valid split
train_data, valid_data = self._trainvalid_split(dataset)
# fit model
model = self._get_model()
model.fit(train_data)
model_valid_cost = self._get_valid_scores(model, valid_data)
# compare to dummy baseline model
cst_model = ConstantByUserModel()
cst_model.fit(train_data)
cst_model_valid_cost = self._get_valid_scores(cst_model, valid_data)
# test our model is better
assert model_valid_cost < cst_model_valid_cost
71. Versioning – Databases vs ML Data Objects
Database ML Data Objects
Persistent Ephemeral
Slow incremental updates Frequent drastic changes from scratch
Foreign keys and locks to prevent
corrupted data
Objects dependencies not expressed
programmatically
72. Versioning
ML Objects Identifiers
ML objects keep changing and depend on each other (e.g. datasets, models weights, items weights)
Store unique identifiers for all ML objects (data hash or unique id)
Keep track of identifiers of dependencies to prevent corrupted data
Versioning
Version ML objects to keep track of architecture changes
Match code version against data version
e.g. weights for 3-layers n-net must never be used in the code for 2-layers architecture
74. Orchestration
Manual Workflow
At first, execute your workflow manually or with CI/CD triggers
E.g. bash scripts, python scripts, Gitlab triggers, cron jobs
Dedicated Workflow Orchestrator and Jobs DAGs
When you have dozens of ML jobs that depend on each other, you will end up doing mistakes:
● making pipeline fail
● or worst: deploying corrupted data
Workflow Management System: Airflow, Luigi, Oozie, ...
WMS tutorial: http://bit.ly/2MUM0r9
75. Microservices and Kubernetes
Microservices
● All microservices must be safe to replicate/delete for scaling up/down to adjust for variable load
→ source of truth for data cannot be a microservice
→ has to be managed in the cloud
76. Microservices and Kubernetes
Microservices
● All microservices must be safe to replicate/delete for scaling up/down to adjust for variable load
→ source of truth for data cannot be a microservice
→ has to be managed in the cloud
● Fine-tune liveness probes (and readiness probes) to allow slow init of containers
[new in Kubernetes 1.16: startup probes]
77. Microservices and Kubernetes
Microservices
● All microservices must be safe to replicate/delete for scaling up/down to adjust for variable load
→ source of truth for data cannot be a microservice
→ has to be managed in the cloud
● Fine-tune liveness probes (and readiness probes) to allow slow init of containers
[new in Kubernetes 1.16: startup probes]
● Memory is expensive
→ split microservices in a way you can still share memory between workers and avoid duplication
78. Microservices and Kubernetes
Microservices
● All microservices must be safe to replicate/delete for scaling up/down to adjust for variable load
→ source of truth for data cannot be a microservice
→ has to be managed in the cloud
● Fine-tune liveness probes (and readiness probes) to allow slow init of containers
[new in Kubernetes 1.16: startup probes]
● Memory is expensive
→ split microservices in a way you can still share memory between workers and avoid duplication
● Microservices for RecSys require lots of memory and compute
→ favor a few big nodes more than many small ones
79. Micro-Services and Kubernetes
Model Split-Brain Problem
● No-downtime updates require blue/green or canary deployment
● Your architecture has to serve two models at once
● And you should not expect updating both the code and the data at the same time
89. ML Data File Format
Cross-language
HDFS has great integration in python/pandas
Protobuf has great integration in Go, but very slow in python (as of today)
In Python: pickle+gzip
Can pickle numpy array efficiently
Only uses standard library
Faster to read/write than HDFS
Alternative: high level library like sklearn’s joblib
90. Cloud Storage – Experiments Results
NoSQL ML Experiments Management Comparison
Comparison with code samples: http://bit.ly/2JtZg3M
Storage Language Scaling UI
MlFlow file-based Python, R, Java Small internal
DVC file-based Shell Small no
Sacred MongoDB Python Large external
TensorBoard file-based TF, PyTorch Medium great
91. Cloud Storage – Large ML Files
Static Storage for Saved Models and Datasets
File-based static storage
● either cloud (AWS S3, Google Storage, etc.)
● or NAS virtual file system
Limited metadata support (e.g. tags)
Requires strict file naming conventions
92. Tools
Not much 🤷♂
In Python
● Use numpy structured arrays http://bit.ly/2BP1hU3
● With Intel CPU, use intel-numpy-mkl https://dockr.ly/321u2Yp
● Use scipy sparse matrices http://bit.ly/36bJwMt
● Pandas
(but may unnecessary slow vs pure numpy)
● Stay tuned! github.com/Crossing-Minds
93. Real-Time Deployment – Conclusion
● Replicate items embeddings and model weights in memory
● Re-deploy to update (read-only for easy memory sharing)
● User embeddings can be managed in a cloud key-value store
● Proper identification, versioning, and naming convention for ML data files
● Manual workflow, then managed by DAGs of jobs