Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Full Stack Deep Learning Testing & Deployment
1. Full Stack Deep Learning
Josh Tobin, Sergey Karayev, Pieter Abbeel
Testing & Deployment
2. Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Testing & Deployment - Overview
3. Full Stack Deep Learning
Module Overview
• Concepts
• Project structure
• ML Test Score: a rubric for production readiness
• Infrastructure/Tooling
• CI/Testing
• Docker in more depth
• Deploying to the web
• Monitoring
• Deploying to hardware/mobile
3
4. Full Stack Deep Learning
4
Training System
- Process raw data
- Run experiments
- Manage results
Prediction System
- Process input data
- Construct networks with
trained weights
- Make predictions
Serving System
- Serve predictions
- Scale to demand
Training System
Tests
- Test full training pipeline
- Start from raw data
- Runs in < 1 day (nightly)
- Catches upstream
regressions
Validation Tests
- Test prediction system on
validation set
- Start from processed data
- Runs in < 1 hour (every
code push)
- Catches model
regressions
Functionality Tests
- Test prediction system on
a few important examples
- Runs in <5 minutes
(during development)
- Catches code regressions
Monitoring
- Alert to downtime,
errors, distribution
shifts
- Catches service and
data regressions
Training/Validation Data Production Data
Testing & Deployment - Project Structure
5. Full Stack Deep Learning
5
The ML Test Score:
A Rubric for ML Production Readiness and Technical Debt Reduction
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley
Google, Inc.
ebreck, cais, nielsene, msalib, dsculley@google.com
bstract—Creating reliable, production-level machine learn-
systems brings on a host of concerns not found in
ll toy examples or even large offline research experiments.
ing and monitoring are key considerations for ensuring
production-readiness of an ML system, and for reducing
nical debt of ML systems. But it can be difficult to formu-
specific tests, given that the actual prediction behavior of
given model is difficult to specify a priori. In this paper,
present 28 specific tests and monitoring needs, drawn from
may find difficult. Note that this rubric focuses on is
specific to ML systems, and so does not include gen
software engineering best practices such as ensuring g
unit test coverage and a well-defined binary release proc
Such strategies remain necessary as well. We do call
a few specific areas for unit or integration tests that h
unique ML-related behavior.
How to read the tests: Each test is written as
IEEE Big Data 2017
An exhaustive framework/checklist from practitioners at Google
Testing & Deployment - ML Test Score
6. Full Stack Deep Learning
6
Follow-up to previous work
from Google
Testing & Deployment - ML Test Score
7. Full Stack Deep Learning
7
ML Systems Require Extensive Testing and Monitoring. The key consideration is that unlike a manually coded system (left),
avior is not easily specified in advance. This behavior depends on dynamic qualities of the data, and on various model configuration
d with determining how reliable an ML system is
an how to build one.
of surprising sources of technical debt in ML
bias. Visualization tools such as Facets1
can be ver
for analyzing the data to produce the schema. Invar
capture in a schema can also be inferred automatica
Traditional Software
ML Systems Require Extensive Testing and Monitoring. The key consideration is that unlike a manually coded system
ehavior is not easily specified in advance. This behavior depends on dynamic qualities of the data, and on various model config
ed with determining how reliable an ML system is
han how to build one.
s of surprising sources of technical debt in ML
bias. Visualization tools such as Facets1
can b
for analyzing the data to produce the schema
capture in a schema can also be inferred autom
Machine Learning Software
Training
System
Prediction
System
Serving
System
Data
Testing & Deployment - ML Test Score
8. Full Stack Deep Learning
8and then compare them to the data to avoid an anchoring
1 Feature expectations are captured in a schema.
2 All features are beneficial.
3 No feature’s cost is too much.
4 Features adhere to meta-level requirements.
5 The data pipeline has appropriate privacy controls.
6 New features can be added quickly.
7 All input feature code is tested.
Table I
BRIEF LISTING OF THE SEVEN DATA TESTS.
out a wide variety of potential features to improv
quality.
How? Programmatically enforce these requ
that all models in production properly adhere t
Data 5: The data pipeline has appropr
controls: Training data, validation data, and voc
all have the potential to contain sensitive user
teams often are aware of the need to remove per
tifiable information (PII), during this type of e
1https://pair-code.github.io/facets/
Data Tests
Table III
BRIEF LISTING OF THE ML INFRASTRUCTURE TESTS
1 Training is reproducible.
2 Model specs are unit tested.
3 The ML pipeline is Integration tested.
4 Model quality is validated before serving.
5 The model is debuggable.
6 Models are canaried before serving.
7 Serving models can be rolled back.
country, users by frequency of use, or movies by genre.
Infra 1: Training is reproducible: Idea
twice on the same data should produce two ide
els. Deterministic training dramatically simplifie
about the whole system and can aid auditability
ging. For example, optimizing feature generatio
delicate process but verifying that the old and
generation code will train to an identical model
more confidence that the refactoring was correc
of diff-testing relies entirely on deterministic tra
Unfortunately, model training is often not rep
practice, especially when working with non-conv
ML Infrastructure Tests
the model specification can help make training auditabl
improve reproducibility.
1 Model specs are reviewed and submitted.
2 Offline and online metrics correlate.
3 All hyperparameters have been tuned.
4 The impact of model staleness is known.
5 A simpler model is not better.
6 Model quality is sufficient on important data slices.
7 The model is tested for considerations of inclusion.
Table II
BRIEF LISTING OF THE SEVEN MODEL TESTS
Model Tests
Table IV
BRIEF LISTING OF THE SEVEN MONITORING TESTS
1 Dependency changes result in notification.
2 Data invariants hold for inputs.
3 Training and serving are not skewed.
4 Models are not too stale.
5 Models are numerically stable.
6 Computing performance has not regressed.
7 Prediction quality has not regressed.
normally, when not in emergency conditions.
Monitoring Tests
Testing & Deployment - ML Test Score
9. Full Stack Deep Learning
Scoring the Test
Points Description
0 More of a research project than a productionized system.
(0,1] Not totally untested, but it is worth considering the possibility of serious holes in reliability.
(1,2] There’s been first pass at basic productionization, but additional investment may be needed.
(2,3] Reasonably tested, but it’s possible that more of those tests and procedures may be automated.
(3,5] Strong levels of automated testing and monitoring, appropriate for mission-critical systems.
> 5 Exceptional levels of automated testing and monitoring.
Table V
L Test Score. THIS SCORE IS COMPUTED BY TAKING THE minimum SCORE FROM EACH OF THE FOUR TEST A
MS AT DIFFERENT POINTS IN THEIR DEVELOPMENT MAY REASONABLY AIM TO BE AT DIFFERENT POINTS AL
worth the same number of points. This is numbers”). When we asked if they had d9Testing & Deployment - ML Test Score
10. Full Stack Deep LearningFigure 4. Average scores for interviewed teams. These graphs display the average score for each test across the 36 systems we examined.
• 36 teams with ML
systems in production
at Google
Results at Google
10Testing & Deployment - ML Test Score
12. Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Testing & Deployment - CI/Testing
13. Full Stack Deep Learning
Testing / CI
• Unit / Integration Tests
• Tests for individual module functionality and for the whole system
• Continuous Integration
• Tests are run every time new code is pushed to the repository, before updated model is
deployed.
• SaaS for CI
• CircleCI, Travis, Jenkins, Buildkite
• Containerization (via Docker)
• A self-enclosed environment for running the tests
13Testing & Deployment - CI/Testing
14. Full Stack Deep Learning
14
Training System
- Process raw data
- Run experiments
- Manage results
Prediction System
- Process input data
- Construct networks with
trained weights
- Make predictions
Serving System
- Serve predictions
- Scale to demand
Training System
Tests
- Test full training pipeline
- Start from raw data
- Runs in < 1 day (nightly)
- Catches upstream
regressions
Validation Tests
- Test prediction system on
validation set
- Start from processed data
- Runs in < 1 hour (every
code push)
- Catches model
regressions
Functionality Tests
- Test prediction system on
a few important examples
- Runs in <5 minutes
(during development)
- Catches code regressions
Monitoring
- Alert to downtime,
errors, distribution
shifts
- Catches service and
data regressions
Training/Validation Data Production Data
Testing & Deployment - CI/Testing
15. Full Stack Deep Learning
15
and then compare them to the data to avoid an anchoring
1 Feature expectations are captured in a schema.
2 All features are beneficial.
3 No feature’s cost is too much.
4 Features adhere to meta-level requirements.
5 The data pipeline has appropriate privacy controls.
6 New features can be added quickly.
7 All input feature code is tested.
Table I
BRIEF LISTING OF THE SEVEN DATA TESTS.
out a wide variety of potential features to improv
quality.
How? Programmatically enforce these requ
that all models in production properly adhere t
Data 5: The data pipeline has appropr
controls: Training data, validation data, and voc
all have the potential to contain sensitive user
teams often are aware of the need to remove per
tifiable information (PII), during this type of e
1https://pair-code.github.io/facets/
Data Tests
improve reproducibility.
1 Model specs are reviewed and submitted.
2 Offline and online metrics correlate.
3 All hyperparameters have been tuned.
4 The impact of model staleness is known.
5 A simpler model is not better.
6 Model quality is sufficient on important data slices.
7 The model is tested for considerations of inclusion.
Table II
BRIEF LISTING OF THE SEVEN MODEL TESTS
Model Tests
Table III
BRIEF LISTING OF THE ML INFRASTRUCTURE TESTS
1 Training is reproducible.
2 Model specs are unit tested.
3 The ML pipeline is Integration tested.
4 Model quality is validated before serving.
5 The model is debuggable.
6 Models are canaried before serving.
7 Serving models can be rolled back.
country, users by frequency of use, or movies by genre.
Examining sliced data avoids having fine-grained quality
Infra 1: Training is reproducible: Idea
twice on the same data should produce two ide
els. Deterministic training dramatically simplifie
about the whole system and can aid auditability
ging. For example, optimizing feature generatio
delicate process but verifying that the old and
generation code will train to an identical model
more confidence that the refactoring was correc
of diff-testing relies entirely on deterministic tra
Unfortunately, model training is often not rep
practice, especially when working with non-conv
ML Infrastructure Tests
Table IV
BRIEF LISTING OF THE SEVEN MONITORING TESTS
1 Dependency changes result in notification.
2 Data invariants hold for inputs.
3 Training and serving are not skewed.
4 Models are not too stale.
5 Models are numerically stable.
6 Computing performance has not regressed.
7 Prediction quality has not regressed.
normally, when not in emergency conditions.
Monitoring Tests
Testing & Deployment - CI/Testing
16. Full Stack Deep Learning
CircleCI / Travis / etc
• SaaS for continuous integration
• Integrate with your repository: every push kicks off a cloud job
• Job can be defined as commands in a Docker container, and can store
results for later review
• No GPUs
• CircleCI has a free plan
16Testing & Deployment - CI/Testing
17. Full Stack Deep Learning
Jenkins / Buildkite
• Nice option for running CI on your own hardware, in the cloud, or mixed
• Good option for a scheduled training test (can use your GPUs)
• Very flexible
17
18. Full Stack Deep Learning
Docker
• Before proceeding, we have to understand Docker better.
• Docker vs VM
• Dockerfile and layer-based images
• DockerHub and the ecosystem
• Kubernetes and other orchestrators
18Testing & Deployment - Docker
19. Full Stack Deep Learning
19
https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b
No OS -> Light weight
Testing & Deployment - Docker
20. Full Stack Deep Learning
• Spin up a container for
every discrete task
• For example, a web app
might have four containers:
• Web server
• Database
• Job queue
• Worker
20
https://www.docker.com/what-container
Light weight -> Heavy use
Testing & Deployment - Docker
21. Full Stack Deep Learning
Dockerfile
21Testing & Deployment - Docker
22. Full Stack Deep Learning
Strong Ecosystem
• Images are easy to
find, modify, and
contribute back to
DockerHub
• Private images
easy to store in
same place
22
https://docs.docker.com/engine/docker-overview
Testing & Deployment - Docker
23. Full Stack Deep Learning
23
https://www.docker.com/what-container#/package_software
Testing & Deployment - Docker
24. Full Stack Deep Learning
Container Orchestration
• Distributing containers onto underlying VMs
or bare metal
• Kubernetes is the open-source winner, but
cloud providers have good offerings, too.
24Testing & Deployment - Docker
26. Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Testing & Deployment - Web Deployment
27. Full Stack Deep Learning
Terminology
• Inference: making predictions (not training)
• Model Serving: ML-specific deployment, usually focused on optimizing
GPU usage
27Testing & Deployment - Web Deployment
28. Full Stack Deep Learning
Web Deployment
• REST API
• Serving predictions in response to canonically-formatted HTTP requests
• The web server is running and calling the prediction system
• Options
• Deploying code to VMs, scale by adding instances
• Deploy code as containers, scale via orchestration
• Deploy code as a “serverless function”
• Deploy via a model serving solution
28Testing & Deployment - Web Deployment
29. Full Stack Deep Learning
29
Training System
- Process raw data
- Run experiments
- Manage results
Prediction System
- Process input data
- Construct networks with
trained weights
- Make predictions
Serving System
- Serve predictions
- Scale to demand
Training System
Tests
- Test full training pipeline
- Start from raw data
- Runs in < 1 day (nightly)
- Catches upstream
regressions
Validation Tests
- Test prediction system on
validation set
- Start from processed data
- Runs in < 1 hour (every
code push)
- Catches model
regressions
Functionality Tests
- Test prediction system on
a few important examples
- Runs in <5 minutes
(during development)
- Catches code regressions
Monitoring
- Alert to downtime,
errors, distribution
shifts
- Catches service and
data regressions
Training/Validation Data Production Data
Testing & Deployment - Web Deployment
30. Full Stack Deep Learning
30
and then compare them to the data to avoid an anchoring
1 Feature expectations are captured in a schema.
2 All features are beneficial.
3 No feature’s cost is too much.
4 Features adhere to meta-level requirements.
5 The data pipeline has appropriate privacy controls.
6 New features can be added quickly.
7 All input feature code is tested.
Table I
BRIEF LISTING OF THE SEVEN DATA TESTS.
out a wide variety of potential features to improv
quality.
How? Programmatically enforce these requ
that all models in production properly adhere t
Data 5: The data pipeline has appropr
controls: Training data, validation data, and voc
all have the potential to contain sensitive user
teams often are aware of the need to remove per
tifiable information (PII), during this type of e
1https://pair-code.github.io/facets/
Data Tests
improve reproducibility.
1 Model specs are reviewed and submitted.
2 Offline and online metrics correlate.
3 All hyperparameters have been tuned.
4 The impact of model staleness is known.
5 A simpler model is not better.
6 Model quality is sufficient on important data slices.
7 The model is tested for considerations of inclusion.
Table II
BRIEF LISTING OF THE SEVEN MODEL TESTS
Model Tests
Table III
BRIEF LISTING OF THE ML INFRASTRUCTURE TESTS
1 Training is reproducible.
2 Model specs are unit tested.
3 The ML pipeline is Integration tested.
4 Model quality is validated before serving.
5 The model is debuggable.
6 Models are canaried before serving.
7 Serving models can be rolled back.
country, users by frequency of use, or movies by genre.
Examining sliced data avoids having fine-grained quality
Infra 1: Training is reproducible: Idea
twice on the same data should produce two ide
els. Deterministic training dramatically simplifie
about the whole system and can aid auditability
ging. For example, optimizing feature generatio
delicate process but verifying that the old and
generation code will train to an identical model
more confidence that the refactoring was correc
of diff-testing relies entirely on deterministic tra
Unfortunately, model training is often not rep
practice, especially when working with non-conv
ML Infrastructure Tests
Table IV
BRIEF LISTING OF THE SEVEN MONITORING TESTS
1 Dependency changes result in notification.
2 Data invariants hold for inputs.
3 Training and serving are not skewed.
4 Models are not too stale.
5 Models are numerically stable.
6 Computing performance has not regressed.
7 Prediction quality has not regressed.
normally, when not in emergency conditions.
Monitoring Tests
Testing & Deployment - Web Deployment
31. Full Stack Deep Learning
31
DEPLOY CODE TO YOUR OWN BARE METAL
Testing & Deployment - Web Deployment
32. Full Stack Deep Learning
32
DEPLOY CODE TO YOUR OWN BARE METAL
DEPLOY CODE TO CLOUD INSTANCES
Testing & Deployment - Web Deployment
33. Full Stack Deep Learning
Deploying code to cloud instances
• Each instance has to be provisioned with dependencies and app code
• Number of instances is managed manually or via auto-scaling
• Load balancer sends user traffic to the instances
33
34. Full Stack Deep Learning
Deploying code to cloud instances
• Cons:
• Provisioning can be brittle
• Paying for instances even when not using them (auto-scaling does help)
34Testing & Deployment - Web Deployment
35. Full Stack Deep Learning
35
DEPLOY DOCKER CONTAINERS TO
CLOUD INSTANCES
DEPLOY CODE TO YOUR OWN BARE METAL
DEPLOY CODE TO CLOUD INSTANCES
Testing & Deployment - Web Deployment
36. Full Stack Deep Learning
Deploying containers
• App code and dependencies are packaged into Docker containers
• Kubernetes or alternative (AWS Fargate) orchestrates containers (DB /
workers / web servers / etc.
36Testing & Deployment - Web Deployment
37. Full Stack Deep Learning
Deploying containers
• Cons:
• Still managing your own servers and paying for uptime, not compute-
time
37Testing & Deployment - Web Deployment
38. Full Stack Deep Learning
38
DEPLOY CODE TO YOUR OWN BARE METAL
DEPLOY CODE TO CLOUD INSTANCES
DEPLOY DOCKER CONTAINERS TO
CLOUD INSTANCES
DEPLOY SERVERLESS FUNCTIONS
Testing & Deployment - Web Deployment
39. Full Stack Deep Learning
Deploying code as serverless functions
• App code and dependencies are packaged into .zip files, with a single
entry point function
• AWS Lambda (or Google Cloud Functions, or Azure Functions) manages
everything else: instant scaling to 10,000+ requests per second, load
balancing, etc.
• Only pay for compute-time.
39Testing & Deployment - Web Deployment
40. Full Stack Deep Learning
40
Testing & Deployment - Web Deployment
41. Full Stack Deep Learning
Deploying code as serverless functions
• Cons:
• Entire deployment package has to fit within 500MB, <5 min execution,
<3GB memory (on AWS Lambda)
• Only CPU execution
41Testing & Deployment - Web Deployment
42. Full Stack Deep Learning
42
and then compare them to the data to avoid an anchoring
1 Feature expectations are captured in a schema.
2 All features are beneficial.
3 No feature’s cost is too much.
4 Features adhere to meta-level requirements.
5 The data pipeline has appropriate privacy controls.
6 New features can be added quickly.
7 All input feature code is tested.
Table I
BRIEF LISTING OF THE SEVEN DATA TESTS.
out a wide variety of potential features to improv
quality.
How? Programmatically enforce these requ
that all models in production properly adhere t
Data 5: The data pipeline has appropr
controls: Training data, validation data, and voc
all have the potential to contain sensitive user
teams often are aware of the need to remove per
tifiable information (PII), during this type of e
1https://pair-code.github.io/facets/
Data Tests
improve reproducibility.
1 Model specs are reviewed and submitted.
2 Offline and online metrics correlate.
3 All hyperparameters have been tuned.
4 The impact of model staleness is known.
5 A simpler model is not better.
6 Model quality is sufficient on important data slices.
7 The model is tested for considerations of inclusion.
Table II
BRIEF LISTING OF THE SEVEN MODEL TESTS
Model Tests
Table III
BRIEF LISTING OF THE ML INFRASTRUCTURE TESTS
1 Training is reproducible.
2 Model specs are unit tested.
3 The ML pipeline is Integration tested.
4 Model quality is validated before serving.
5 The model is debuggable.
6 Models are canaried before serving.
7 Serving models can be rolled back.
country, users by frequency of use, or movies by genre.
Examining sliced data avoids having fine-grained quality
Infra 1: Training is reproducible: Idea
twice on the same data should produce two ide
els. Deterministic training dramatically simplifie
about the whole system and can aid auditability
ging. For example, optimizing feature generatio
delicate process but verifying that the old and
generation code will train to an identical model
more confidence that the refactoring was correc
of diff-testing relies entirely on deterministic tra
Unfortunately, model training is often not rep
practice, especially when working with non-conv
ML Infrastructure Tests
Table IV
BRIEF LISTING OF THE SEVEN MONITORING TESTS
1 Dependency changes result in notification.
2 Data invariants hold for inputs.
3 Training and serving are not skewed.
4 Models are not too stale.
5 Models are numerically stable.
6 Computing performance has not regressed.
7 Prediction quality has not regressed.
normally, when not in emergency conditions.
Monitoring Tests
Testing & Deployment - Web Deployment
43. Full Stack Deep Learning
Deploying code as serverless functions
• Canarying
• Easy to have two versions of Lambda functions in production, and start
sending low volume traffic to one.
• Rollback
• Easy to switch back to the previous version of the function.
43Testing & Deployment - Web Deployment
44. Full Stack Deep Learning
Model Serving
• Web deployment options specialized for machine learning models
• Tensorflow Serving (Google)
• Model Server for MXNet (Amazon)
• Clipper (Berkeley RISE Lab)
• SaaS solutions like Algorithmia
• The most important feature is batching requests for GPU inference
44Testing & Deployment - Web Deployment
45. Full Stack Deep Learning
Tensorflow Serving
• Able to serve Tensorflow predictions at high
throughput.
• Used by Google and their “all-in-one” machine
learning offerings
• Overkill unless you know you need to use GPU for
inference
45
Figure 2: TFS2
(hosted model serving) architecture.
2.2.1 Inter-Request Batching
46. Full Stack Deep Learning
MXNet Model Server
• Amazon’s answer to Tensorflow Serving
• Part of their SageMaker “all-in-one” machine learning offering.
46
47. Full Stack Deep Learning
Clipper
• Open-source model serving via
REST, using Docker.
• Nice-to-haves on top of the basics
• e.g “state-of-the-art bandit and
ensemble methods to
intelligently select and combine
predictions”
• Probably too early to be actually
useful (unless you know why you
need it)
47
48. Full Stack Deep Learning
Algorithmia
• Some startups promise effortless model serving
48
49. Full Stack Deep Learning
Seldon
49
https://www.seldon.io/tech/products/core/
Testing & Deployment - Web Deployment
50. Full Stack Deep Learning
Takeaway
• If you are doing CPU inference, can get away with scaling by launching
more servers, or going serverless.
• The dream is deploying Docker as easily as Lambda
• Next best thing is either Lambda or Docker, depending on your model’s
needs.
• If using GPU inference, things like TF Serving and Clipper become useful
by adaptive batching, etc.
50Testing & Deployment - Web Deployment
52. Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Testing & Deployment - Monitoring
53. Full Stack Deep Learning
53
Training System
- Process raw data
- Run experiments
- Manage results
Prediction System
- Process input data
- Construct networks with
trained weights
- Make predictions
Serving System
- Serve predictions
- Scale to demand
Training System
Tests
- Test full training pipeline
- Start from raw data
- Runs in < 1 day (nightly)
- Catches upstream
regressions
Validation Tests
- Test prediction system on
validation set
- Start from processed data
- Runs in < 1 hour (every
code push)
- Catches model
regressions
Functionality Tests
- Test prediction system on
a few important examples
- Runs in <5 minutes
(during development)
- Catches code regressions
Monitoring
- Alert to downtime,
errors, distribution
shifts
- Catches service and
data regressions
Training/Validation Data Production Data
Testing & Deployment - Monitoring
54. Full Stack Deep Learning
54
and then compare them to the data to avoid an anchoring
1 Feature expectations are captured in a schema.
2 All features are beneficial.
3 No feature’s cost is too much.
4 Features adhere to meta-level requirements.
5 The data pipeline has appropriate privacy controls.
6 New features can be added quickly.
7 All input feature code is tested.
Table I
BRIEF LISTING OF THE SEVEN DATA TESTS.
out a wide variety of potential features to improv
quality.
How? Programmatically enforce these requ
that all models in production properly adhere t
Data 5: The data pipeline has appropr
controls: Training data, validation data, and voc
all have the potential to contain sensitive user
teams often are aware of the need to remove per
tifiable information (PII), during this type of e
1https://pair-code.github.io/facets/
Data Tests
improve reproducibility.
1 Model specs are reviewed and submitted.
2 Offline and online metrics correlate.
3 All hyperparameters have been tuned.
4 The impact of model staleness is known.
5 A simpler model is not better.
6 Model quality is sufficient on important data slices.
7 The model is tested for considerations of inclusion.
Table II
BRIEF LISTING OF THE SEVEN MODEL TESTS
Model Tests
Table III
BRIEF LISTING OF THE ML INFRASTRUCTURE TESTS
1 Training is reproducible.
2 Model specs are unit tested.
3 The ML pipeline is Integration tested.
4 Model quality is validated before serving.
5 The model is debuggable.
6 Models are canaried before serving.
7 Serving models can be rolled back.
country, users by frequency of use, or movies by genre.
Examining sliced data avoids having fine-grained quality
Infra 1: Training is reproducible: Idea
twice on the same data should produce two ide
els. Deterministic training dramatically simplifie
about the whole system and can aid auditability
ging. For example, optimizing feature generatio
delicate process but verifying that the old and
generation code will train to an identical model
more confidence that the refactoring was correc
of diff-testing relies entirely on deterministic tra
Unfortunately, model training is often not rep
practice, especially when working with non-conv
ML Infrastructure Tests
Table IV
BRIEF LISTING OF THE SEVEN MONITORING TESTS
1 Dependency changes result in notification.
2 Data invariants hold for inputs.
3 Training and serving are not skewed.
4 Models are not too stale.
5 Models are numerically stable.
6 Computing performance has not regressed.
7 Prediction quality has not regressed.
normally, when not in emergency conditions.
Monitoring Tests
Testing & Deployment - Monitoring
55. Full Stack Deep Learning
Monitoring
• Alarms for when things go
wrong, and records for tuning
things.
• Cloud providers have decent
monitoring solutions.
• Anything that can be logged
can be monitored (i.e. data
skew)
• Will explore in lab
55Testing & Deployment - Monitoring
56. Full Stack Deep Learning
Data Distribution Monitoring
• Underserved need!
56
Domino Data Lab
57. Full Stack Deep Learning
Closing the Flywheel
• Important to monitor the business uses of the model, not just its own
statistics
• Important to be able to contribute failures back to dataset
57Testing & Deployment - Monitoring
58. Full Stack Deep Learning
Closing the Flywheel
58Testing & Deployment - Monitoring
59. Full Stack Deep Learning 59Testing & Deployment - Monitoring
Closing the Flywheel
60. Full Stack Deep Learning 60Testing & Deployment - Monitoring
Closing the Flywheel
61. Full Stack Deep Learning 61Testing & Deployment - Monitoring
Closing the Flywheel
63. Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Testing & Deployment - Hardware/Mobile
64. Full Stack Deep Learning
Problems
• Embedded and mobile frameworks are less fully featured than full
PyTorch/Tensorflow
• Have to be careful with architecture
• Interchange format
• Embedded and mobile devices have little memory and slow/expensive
compute
• Have to reduce network size / quantize weights / distill knowledge
64Testing & Deployment - Hardware/Mobile
65. Full Stack Deep Learning
Problems
• Embedded and mobile frameworks are less fully featured than full
PyTorch/Tensorflow
• Have to be careful with architecture
• Interchange format
• Embedded and mobile devices have little memory and slow/expensive
compute
• Have to reduce network size / quantize weights / distill knowledge
65Testing & Deployment - Hardware/Mobile
66. Full Stack Deep Learning
Tensorflow Lite
• Not all models will work, but much better than "Tensorflow Mobile" used
to be
66Testing & Deployment - Hardware/Mobile
67. Full Stack Deep Learning
PyTorch Mobile
• Optional quantization
• TorchScript: static execution graph
• Libraries for Android and iOS
67
68. Full Stack Deep Learning
68
https://heartbeat.fritz.ai/core-ml-vs-ml-kit-which-mobile-machine-learning-framework-is-right-for-you-e25c5d34c765
CoreML
- Released at Apple WWDC 2017
- Inference only
- https://coreml.store/
- Supposed to work with both
CoreML and MLKit
- Announced at Google I/O 2018
- Either via API or on-device
- Offers pre-trained models, or
can upload Tensorflow Lite
model
Testing & Deployment - Hardware/Mobile
69. Full Stack Deep Learning
Problems
69Testing & Deployment - Hardware/Mobile
• Embedded and mobile frameworks are less fully featured than full
PyTorch/Tensorflow
• Have to be careful with architecture
• Interchange format
• Embedded and mobile devices have little memory and slow/expensive
compute
• Have to reduce network size / quantize weights / distill knowledge
70. Full Stack Deep Learning
• Open Neural Network Exchange: open-source format for deep learning
models
• The dream is to mix different frameworks, such that frameworks that are
good for development (PyTorch) don’t also have to be good at inference
(Caffe2)
70Testing & Deployment - Hardware/Mobile
Interchange Format
71. Full Stack Deep Learning 71Testing & Deployment - Hardware/Mobile
Interchange Format
https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-onnx
72. Full Stack Deep Learning
Problems
• Embedded and mobile frameworks are less fully featured than full
PyTorch/Tensorflow
• Have to be careful with architecture
• Interchange format
• Embedded and mobile devices have little memory and slow/expensive
compute
• Have to reduce network size / quantize weights / distill knowledge
72Testing & Deployment - Hardware/Mobile
73. Full Stack Deep Learning
NVIDIA for Embedded
• Todo
73Testing & Deployment - Hardware/Mobile
74. Full Stack Deep Learning
Problems
74Testing & Deployment - Hardware/Mobile
• Embedded and mobile frameworks are less fully featured than full
PyTorch/Tensorflow
• Have to be careful with architecture
• Interchange format
• Embedded and mobile devices have little memory and slow/expensive
compute
• Have to reduce network size / quantize weights / distill knowledge
75. Full Stack Deep Learning
75
Standard
1x1
Inception
https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
MobileNet
Testing & Deployment - Hardware/Mobile
MobileNets
76. Full Stack Deep Learning
76
https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
Testing & Deployment - Hardware/Mobile
77. Full Stack Deep Learning
Problems
77Testing & Deployment - Hardware/Mobile
• Embedded and mobile frameworks are less fully featured than full
PyTorch/Tensorflow
• Have to be careful with architecture
• Interchange format
• Embedded and mobile devices have little memory and slow/expensive
compute
• Have to reduce network size / quantize weights / distill knowledge
78. Full Stack Deep Learning
Knowledge distillation
• First, train a large "teacher" network on the data directly
• Next, train a smaller "student" network to match the "teacher" network's
prediction distribution
78
79. Full Stack Deep Learning
Recommended case study: DistillBERT
79
https://medium.com/huggingface/distilbert-8cf3380435b5