Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Monitoring Models in Production
1. Monitoring Models in Production
Keeping track of complex models in a complex world
Jannes Klaas
2. About me
International
Business @ RSM
Financial Economics
@ Oxford Saïd
Course developer
machine learning @
Turing Society
Author “Machine
Learning for
Finance” out in July
ML consultant non-
profits / impact
investors
Prev. Urban Planning
@ IHS Rotterdam &
Destroyer of my
Startup
3. The life and
times of an
ML
practitioner
“We send you the data,
you send us back a model,
then we take it from there”
– Consulting Clients
“Define an approach,
evaluate on common
benchmark and publish” –
Academia
4. Repeat after
me
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
5. Machine
learning 101
Estimate some function y = f(x)
using (x,y) pairs
Estimated function hopefully
represents the true relationship
between x and y
Model is function of data
6.
7. Problems you encounter in
production
• The world changes, your training data might
no longer depict the real world
• Your model inputs might change
• There might be unintended bugs and side
effects in complex models
• Models influence the world the try to model
• Model decay: Your model usually becomes
worse over time
8. Are models
a liability
after
shipping?
No, the real world is the perfect
training environment
Datasets are only an
approximation of the real world
Active learning on real world
examples can greatly reduce
your data needs
9. Online learning
• Update model continuously as new data streams
in
• Good if you have continuous stream of ground
truth as well
• Needs more monitoring to ensure model does
not go off track
• Can be expensive for big models
• Might need separate training / inference
hardware
11. Model monitoring vs Ops monitoring
• Model monitoring models model behavior
• Inherently stochastic
• Can be driven by user behavior
• Almost certainly looking for unknown unknowns
• Few established guidelines on what to monitor
12. Monitoring inputs
•E.g. images arriving at model very small, very dark, high contrast, etc.
More similar to ops monitoring as there can be obvious failures
•Means
•Standard deviations
•Correlations
•KL Divergence between Training & Live data
Monitor classic stats, compare to training data
13. Output
monitoring
Harder, people might just upload more
plane images one day
Monitoring prediction distribution
surprisingly helpful
Monitor confidence (highest
probability – lowest probability)
Compare against other model
predictions
Compare against ground truth
14. Ground truth
• In absence of a ground truth signal, ground truth
needs to be established manually
• Can be done by data scientists themselves with good
UI design
• Yields extra insights ‘Our model does worse when
Instagram filters are applied’ / ‘Many users take
sideways pictures’
• Prioritize low confidence predictions for active
learning
• Sample randomly for monitoring
16. Alerting / Monitoring is a
UI/UX problem
• The terms might be very hard to explain or
interpret
• Who here would comfortably write down
the formula for KL Divergence and
explain what it means?
• Key metrics are different depending on use
case
• Non – Datascientists might have to make
decisions based on alerts
18. Alerting Example
• Detected !"#(%&'()| +(,- = 1.56394694
which is out of bounds
• Detected model output distribution
significantly different from training data
• Detected an unexpected amount of
pictures classified as Pugs
19. Model accountability
• Who made the decision
• Model versioning, all versions need to be retained
• On which grounds was the decision made
• All input data needs to be retained and must be linked to transaction ID
• Why was the decision made
• Use tools like LIME to interpret models
• Still a long way to interpretable deep models, but we are getting there
21. Large impact effects…
• … are hard to monitor
• … are not what data scientists are trained for
• … only show with large scale deployment
• … are time delayed
• … are influenced by exogenous factors, too
24. A simple monitoring system with Flask
User Keras + Flask SciKit + Flask Data Scientist
Image
Classification
Image +
Classification Alerts
Transaction DB
Store
transaction
Provide
benchmark
data
25. Bare Bones Flask Serving
image = flask.request.files["image"].read()
image = prepare_image(image, target=(224, 224))
preds = model.predict(image)
results = decode_predictions(preds)
data["predictions"] = []
for (label, prob) in results[0]:
r = {"label": label, "probability": float(prob)}
data["predictions"].append(r)
data["success"] = True
return flask.jsonify(data)
26. Statistical monitoring with SciKit
ent = scipy.stats.entropy(pk,qk,base=2)
if ent > threshold:
abs_diff = np.abs(pk-qk)
worst_offender = lookup[np.argmax(abs_diff)]
max_deviation = np.max(abs_diff)
alert(model_id,ent,
worst_offender,max_deviation)
27. Data science teams should own the whole process
Define
approach
Feature
Engineering
Train
model
Deploy
Monitor
28. Unsolved challenges
• Model versioning
• Dataset versioning
• Continuous Integration for data scientists
• Communication and understanding of model
metrics in the Org
• Managing higher order effects
29. Recommended reading
• Sculley et al. (2015) Hidden Technical Debt in Machine Learning
Systems https://papers.nips.cc/paper/5656-hidden-technical-debt-
in-machine-learning-systems.pdf
• Breck et al. (2016) What’s your ML Test Score? A rubric for ML
production systems https://ai.google/research/pubs/pub45742
• How Zendesk Serves TensorFlow Models in Production
https://medium.com/zendesk-engineering/how-zendesk-serves-
tensorflow-models-in-production-751ee22f0f4b
• Machine Learning for Finance ;) https://www.packtpub.com/big-
data-and-business-intelligence/machine-learning-finance