SlideShare a Scribd company logo
1 of 96
Download to read offline
Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel
Beyond the Black Box:
Testing and Explaining ML Systems
Full Stack Deep Learning - UC Berkeley Spring 2021
Test score < target performance (in school)
2
Full Stack Deep Learning - UC Berkeley Spring 2021
Test score < target performance (real world)
3
How do we know we can trust it?
Full Stack Deep Learning - UC Berkeley Spring 2021
What’s wrong with the black box?
4
Full Stack Deep Learning - UC Berkeley Spring 2021
What’s wrong with the black box?
• Real-world distribution doesn’t always match the offline distribution

• Drift, shift, and malicious users

• Expected performance doesn’t tell the whole story

• Slices and edge cases

• Your model your machine learning system

• You’re probably not optimizing the exact metrics you care about
≠
5
What does good test set performance tell us?
If test data and production data come from the same distribution, then in expectation the
performance of your model on your evaluation metrics will be the same
Full Stack Deep Learning - UC Berkeley Spring 2021
How bad is it, really?
6
“I think the single biggest thing holding back the
autonomous vehicle industry today is that, even if we had
a car that worked, no one would know, because no one is
confident that they know how to properly evaluate it”
Former ML engineer at autonomous vehicle company
Full Stack Deep Learning - UC Berkeley Spring 2021
Goal of this lecture
• Introduce concepts and methods to help you, your team, and your users:

• Understand at a deeper level how well your model is performing

• Become more confident in your model’s ability to perform well in production

• Understand the model’s performance envelope, i.e., where you should expect it
to perform well and where not
7
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Software testing

• Testing machine learning systems

• Explainable and interpretable AI

• What goes wrong in the real world?
8
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
9
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Software testing
• Testing machine learning systems

• Explainable and interpretable AI

• What goes wrong in the real world?
10
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests

• Best practices

• Testing in production

• CI/CD
11
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests
• Best practices

• Testing in production

• CI/CD
12
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic types of software tests
• Unit tests: test the functionality of a single piece of the code (e.g., a
single function or class)

• Integration tests: test how two or more units perform when used
together (e.g., test if a model works well with a preprocessing function)

• End-to-end tests: test that the entire system performs when run together,
e.g., on real input from a real user
13
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Other types of software tests
• Mutation tests

• Contract tests

• Functional tests

• Fuzz tests

• Load tests

• Smoke tests

• Coverage tests

• Acceptance tests

• Property-base tests

• Usability tests

• Benchmarking

• Stress tests

• Shadow tests

• Etc
14
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests

• Best practices
• Testing in production

• CI/CD
15
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Testing best practices
• Automate your tests

• Make sure your tests are reliable, run fast, and go through the same code
review process as the rest of your code

• Enforce that tests must pass before merging into the main branch

• When you find new production bugs, convert them to tests

• Follow the test pyramid
16
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Best practice: the testing pyramid
• Compared to E2E tests, unit tests are:

• Faster

• More reliable

• Better at isolating failures

• Rough split: 70/20/10
17
Software testing
https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Controversial best practice: solitary tests
• Solitary testing (or mocking) doesn’t rely on real data from other units

• Sociable testing makes the implicit assumption that other modules are
working
18
Software testing
https://martinfowler.com/bliki/UnitTest.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Controversial best practice: test coverage
• Percentage of lines of code called in at least one test

• Single metric for quality of testing

• But it doesn’t measure the right things — in particular, test quality
19
Software testing
https://martinfowler.com/bliki/TestCoverage.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Controversial best practice: test-driven development
20
Software testing
https://martinfowler.com/bliki/TestCoverage.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
21
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests

• Best practices

• Testing in production
• CI/CD
22
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Test in prod, really?
23
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Test in prod, really?
• Traditional view:

• The goal of testing is to prevent shipping bugs to production

• Therefore, by definition you must do your testing offline, before your system goes into
production

• However:

• “Informal surveys reveal that the percentage of bugs found by automated tests are
surprisingly low. Projects with significant, well-designed automation efforts report that
regression tests find about 15 percent of the total bugs reported (Marick, online).” —
Lessons Learned in Software Testing

• Modern distributed systems just make it worse
24
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
The case for testing in production
Bugs are inevitable, might as well set up the system so that users can
help you find them
25
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
The case for testing in production
With sufficiently advanced monitoring & enough scale, it’s a realistic
strategy to write code, push it to prod, & watch the error rates. If
something in another part of the app breaks, it’ll be apparent very quickly
in your error rates. You can either fix or roll back. You’re basically letting
your monitoring system play the role that a regression suite & continuous
integration play on other teams.
26
Software testing
https://twitter.com/sarahmei/status/868928631157870592?s=20
Full Stack Deep Learning - UC Berkeley Spring 2021
How to test in prod
27
Software testing
• Canary deployments

• A/B Testing

• Real user monitoring

• Exploratory testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Test in prod, really.
28
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
29
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests

• Best practices

• Testing in production

• CI/CD
30
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
CircleCI / Travis / Github Actions / etc
• SaaS for continuous integration

• Integrate with your repository: every push kicks off a cloud job

• Job can be defined as commands in a Docker container, and can store
results for later review

• No GPUs

• CircleCI / Github Actions have free plans

• Github actions is super easy to integrate — would start here
31
Full Stack Deep Learning - UC Berkeley Spring 2021
Jenkins / Buildkite
• Nice option for running CI on your own hardware, in the cloud, or mixed

• Good option for a scheduled training test (can use your GPUs)

• Very flexible
32
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
33
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Software testing

• Testing machine learning systems
• Explainable and interpretable AI

• What goes wrong in the real world?
34
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
ML systems are software, so test them like software?
35
Software is…
• Code

• Written by humans to solve a
problem

• Prone to “loud” failures

• Relatively static
… But ML systems are
• Code + data

• Compiled by optimizers to satisfy a
proxy metric

• Prone to silent failures

• Constantly changing
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Common mistakes in testing ML systems
• Only testing the model, not the entire ML system

• Not testing the data

• Not building a granular enough understanding of the performance of the
model before deploying

• Not measuring the relationship between model performance metrics and
business metrics

• Relying too much on automated testing

• Thinking offline testing is enough — not monitoring or testing in prod
36
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Common mistakes in testing ML systems
• Only testing the model, not the entire ML system
• Not testing the data

• Not building a granular enough understanding of the performance of the
model before deploying

• Not measuring the relationship between model performance metrics and
business metrics

• Relying too much on automated testing

• Thinking offline testing is enough — not monitoring or testing in prod
37
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
ML Models vs ML Systems
38
ML
Model
Prediction
system
Serving system
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Training system
Prod
data
Offline Online
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
ML Models vs ML Systems
39
ML
Model
Prediction
system
Serving system
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Training system
Infrastructure tests
Full Stack Deep Learning - UC Berkeley Spring 2021
Infrastructure tests — unit tests for training code
• Goal: avoid bugs in your training pipelines

• How?
• Unit test your training code like you would any other code

• Add single batch or single epoch tests that check performance after
an abbreviated training run on a tiny dataset

• Run frequently during development
40
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
41
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
ML Models vs ML Systems
42
ML
Model
Prediction
system
Serving system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Training system
Training
tests
Storage and
preprocessing
system
Full Stack Deep Learning - UC Berkeley Spring 2021
Training tests
• Goal: make sure training is reproducible

• How?
• Pull a fixed dataset and run a full or abbreviated training run

• Check to make sure model performance remains consistent

• Consider pulling a sliding window of data

• Run periodically (nightly for frequently changing codebases)
43
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Training system
ML Models vs ML Systems
44
ML
Model
Serving system
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Prediction
system
Functionality tests
Full Stack Deep Learning - UC Berkeley Spring 2021
Functionality tests: unit tests for prediction code
• Goal: avoid regressions in the code that makes up your prediction infra

• How?
• Unit test your prediction code like you would any other code

• Load a pretrained model and test prediction on a few key examples

• Run frequently during development
45
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
ML Models vs ML Systems
46
ML
Model
Serving system
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Prediction
system
Evaluation tests
Training system
Full Stack Deep Learning - UC Berkeley Spring 2021
Evaluation tests
• Goal: make sure a new model is ready to go into production

• How?
• Evaluate your model on all of the metrics, datasets, and slices that you
care about

• Compare the new model to the old one and your baselines

• Understand the performance envelope of the new model

• Run every time a new candidate model is created and considered for
production
47
Full Stack Deep Learning - UC Berkeley Spring 2021
Eval tests: more than just your val score
• Look at all of the metrics you care about
• Model metrics (precision, recall, accuracy, L2, etc)

• Behavioral tests

• Robustness tests

• Privacy and fairness tests

• Simulation tests
48
Full Stack Deep Learning - UC Berkeley Spring 2021
Behavioral testing metrics
• Goal: make sure the model has the invariances
we expect

• Types:

• Invariance tests: assert that change in inputs
shouldn’t affect outputs

• Directional tests: assert that change in inputs
should affect outputs

• Min functionality tests: certain inputs and
outputs should always produce a given result

• Mostly used in NLP
49
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
https://arxiv.org/abs/2005.04118
Full Stack Deep Learning - UC Berkeley Spring 2021
Robustness metrics
• Goal: understand the performance envelope of the model, i.e., where you
should expect the model to fail, e.g.,

• Feature importance

• Sensitivity to staleness

• Sensitivity to drift (how to measure this?)

• Correlation between model performance and business metrics (on an old
version of the model)

• Very underrated form of testing
50
Full Stack Deep Learning - UC Berkeley Spring 2021
Privacy and fairness metrics
51
Fairness Definitions Explained (https://fairware.cs.umass.edu/papers/Verma.pdf)
https://ai.googleblog.com/2019/12/fairness-indicators-scalable.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Simulation tests
• Goal: understand how the performance
of the model could affect the rest of the
system

• Useful when your model affects the
world

• Main use is AVs and robotics

• This is hard: need a model of the world,
dataset of scenarios, etc
52
Full Stack Deep Learning - UC Berkeley Spring 2021
Eval tests: more than just your val score
• Evaluate metrics on multiple slices of data
• Slices = mappings of your data to categories

• E.g., value of the “country” feature

• E.g., “age” feature within intervals {0 - 18, 18 - 30, 30-45, 45-65, 65+}

• E.g., arbitrary other machine learning model like a noisy cat classifier
53
Full Stack Deep Learning - UC Berkeley Spring 2021
Surfacing slices
SliceFinder: clever way of computing metrics across
all slices & surfacing the worst performers
54
Automated Data Slicing for Model Validation: A Big
data - AI Integration Approach
What-if tool: visual exploration tool for model
performance slicing
The What-If Tool: Interactive Probing of Machine
Learning Models
Full Stack Deep Learning - UC Berkeley Spring 2021
Eval tests: more than just your val score
• Maintain evaluation datasets for all of the distinct data distributions you need to
measure
• Your main validation set should mirror your test distribution

• When to add new datasets?

• When you collect datasets to specify specific edge cases (e.g., “left hand turn”
dataset)

• When you run your production model on multiple modalities (e.g., “San
Francisco” and “Miami” datasets, English and French datasets)

• When you augment your training set with data not found in production (e.g., “
55
Full Stack Deep Learning - UC Berkeley Spring 2021
Evaluation reports
56
Slice Accuracy Precision Recall
Aggregate 90% 91% 89%
Age <18 87% 89% 90%
Age 18-45 85% 87% 79%
Age 45+ 92% 95% 90%
Slice Accuracy Precision Recall
Aggregate 94% 91% 90%
New
accounts
87% 89% 90%
Frequent
users
85% 87% 79%
Churned
users
50% 36% 70%
Main eval set NYC users
Full Stack Deep Learning - UC Berkeley Spring 2021
Deciding whether the evaluations pass
• Compare new model to previous model and and a fixed older model
• Set threshold on the diffs between new and old for most metrics

• Set thresholds on the diffs between slices

• Set thresholds against the fixed older model to prevent slower
performance “leaks”
57
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
58
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Training system
ML Models vs ML Systems
59
ML
Model
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Prediction
system
Shadow tests
Serving system
Full Stack Deep Learning - UC Berkeley Spring 2021
Shadow tests: catch production bugs before they hit users
• Goals:
• Detect bugs in the production deployment

• Detect inconsistencies between the offline and online model

• Detect issues that appear on production data

• How?
• Run new model in production system, but don’t return predictions to users

• Save the data and run the offline model on it

• Inspect the prediction distributions for old vs new and offline vs online for inconsistencies
60
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Prediction
system
Training system
ML Models vs ML Systems
61
ML
Model
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
AB Tests
Serving system
Full Stack Deep Learning - UC Berkeley Spring 2021
AB Testing: test users’ reaction to the new model
• Goals:
• Understand how your new model affects user
and business metrics

• How?
• Start by “canarying” model on a tiny fraction of
data

• Consider using a more statistically principled
split

• Compare metrics on the two cohorts
62
https://levelup.gitconnected.com/the-engineering-problem-of-a-b-testing-ac1adfd492a8
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
63
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Prediction
system
Training system
ML Models vs ML Systems
64
ML
Model
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Monitoring (next
time!)
Serving system
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Serving system
Prediction
system
Training system
ML Models vs ML Systems
65
ML
Model
Storage and
preprocessing
system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Labeling tests
Labeling system
Full Stack Deep Learning - UC Berkeley Spring 2021
Labeling tests: get more reliable labels
• Goals:
• Catch poor quality labels before

they corrupt your model

• How?
• Train and certify the labelers

• Aggregate labels of multiple labelers

• Assign labelers a “trust score” based on how often they are wrong

• Manually spot check the labels from your labeling service

• Run a previous model on new labels and inspect the ones with biggest disagreement
66
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Labeling system
Serving system
Prediction
system
Training system
ML Models vs ML Systems
67
ML
Model
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Expectation tests
Storage and
preprocessing
system
Full Stack Deep Learning - UC Berkeley Spring 2021
Expectation tests: unit tests for data
• Goals:
• Catch data quality issues before they make your way into your pipeline

• How?
• Define rules about properties of each of your data tables at each stage
in your data cleaning and preprocessing pipeline

• Run them when you run batch data pipeline jobs
68
Full Stack Deep Learning - UC Berkeley Spring 2021
What kinds of expectations?
69
https://greatexpectations.io/
Full Stack Deep Learning - UC Berkeley Spring 2021
How to set expectations?
70
• Manually

• Profile a dataset you expect to be high quality and use those thresholds

• Both
Full Stack Deep Learning - UC Berkeley Spring 2021
Challenges operationalizing ML tests
• Organizational: data science teams don’t always have the same norms
around testing & code review

• Infrastructure: cloud CI/CD platforms don’t have great support for GPUs
or data integrations

• Tooling: no standard toolset for comparing model performance, finding
slices, etc

• Decision making: Hard to determine when test performance is “good
enough”
71
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
How to test your ML system the right way
• Test each part of the ML system, not just the model

• Test code, data, and model performance, not just code

• Testing model performance is an art, not a science

• The goal of testing model performance is to build a granular understanding of how well
your model performs, and where you don’t expect it to perform well

• Build up to this gradually! Start here:

• Infrastructure tests

• Evaluation tests

• Expectation tests
72
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
73
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Software testing

• Testing machine learning systems

• Explainable and interpretable AI
• What goes wrong in the real world?
74
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Some definitions
• Domain predictability: the degree to which it is possible to detect data
outside the model’s domain of competence
• Interpretability: the degree to which a human can consistently predict the
model's result [1]

• Explainability: the degree to which a human can understand the cause of
a decision [2]
75
Explainability
[1] Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. "Examples are not enough, learn to criticize! Criticism for interpretability." Advances in Neural Information Processing Systems (2016).↩
[2] Miller, Tim. "Explanation in artificial intelligence: Insights from the social sciences." arXiv Preprint arXiv:1706.07269. (2017).
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models

• Distill the complex model to an interpretable one

• Understand the contribution of features to the prediction (feature
importance, SHAP)

• Understand the contribution of training data points to the prediction
76
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models
• Distill the complex model to an interpretable one

• Understand the contribution of features to the prediction (feature
importance, SHAP)

• Understand the contribution of training data points to the prediction
77
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Interpretable families of models
• Linear regression

• Logistic regression

• Generalized linear models

• Decision trees
78
Explainability
Interpretable and explainable
(to the right user)
Not very powerful
Full Stack Deep Learning - UC Berkeley Spring 2021
What about attention?
79
Explainability
Interpretable
Not explainable
Full Stack Deep Learning - UC Berkeley Spring 2021
Attention maps are not reliable explanations
80
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
81
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models

• Distill the complex model to an interpretable one
• Understand the contribution of features to the prediction (feature
importance, SHAP)

• Understand the contribution of training data points to the prediction
82
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Surrogate models
• Train an interpretable surrogate
model on your data and your
original model’s predictions

• Use the surrogate’s interpretation
as a proxy for understanding the
underlying model
83
Explainability
Easy to use, general
If your surrogate is good, why
not use it? And how do we
know it decides the same way
Full Stack Deep Learning - UC Berkeley Spring 2021
Local surrogate models (LIME)
• Pick a data point to explain the
prediction for

• Perturb the data point and query
the model for the prediction

• Train a surrogate on the perturbed
dataset
84
Explainability
Widely used in practice.
Works for all data types
Hard to define the perturbations.
Explanations are unstable and
can be manipulated
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models

• Distill the complex model to an interpretable one

• Understand the contribution of features to the prediction
• Understand the contribution of training data points to the prediction
85
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Data viz for feature importance
86
Explainability
Partial dependence plot Individual conditional expectation
Full Stack Deep Learning - UC Berkeley Spring 2021
Permutation feature importance
• Pick a feature

• Randomize its order in the dataset
and see how that affects
performance
87
Explainability
Easy to use. Widely used in
practice
Doesn’t work for high-dim
data. Doesn’t capture feature
interdependence
Full Stack Deep Learning - UC Berkeley Spring 2021
SHAP (SHapley Additive exPlanations)
• Intuition:

• How much does the pretense of
this feature affect the value of
the classifier, independent of
what other features are present?

88
Explainability
Works for a variety of data.
Mathematically principled
Tricky to implement. Still
doesn’t provide explanations
Full Stack Deep Learning - UC Berkeley Spring 2021
Gradient-based saliency maps
• Pick an input (e.g., image)

• Perform a forward pass

• Compute the gradient with respect
to the pixels

• Visualize the gradients
89
Explainability
Easy to use. Many variants
that produce more
interpretable results.
Difficult to know whether the
explanation is correct.
Fragile. Unreliable.
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
90
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models

• Distill the complex model to an interpretable one

• Understand the contribution of features to the prediction

• Understand the contribution of training data points to the prediction
91
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Prototypes and criticisms
92
Full Stack Deep Learning - UC Berkeley Spring 2021
Do you need “explainability”?
• Regulators say my model needs to be explainable

• What do they mean?

• But yeah you should probably do explainability

• Users want my model to be explainable

• Can a better product experience alleviate the need for explainability?

• How often will they be interacting with the model? If frequently, maybe interpretability is enough?

• I want to deploy my models to production more confidently

• You really want domain predictability

• However, interpretable visualizations could be a helpful tool for debugging
93
Full Stack Deep Learning - UC Berkeley Spring 2021
Can you explain deep learning models?
• Not really

• Known explanation methods aren’t faithful to original model
performance

• Known explanation methods are unreliable

• Known explanations don’t fully explain models’ decisions
94
Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (https://arxiv.org/pdf/1811.10154.pdf)
Full Stack Deep Learning - UC Berkeley Spring 2021
Conclusion
• If you need to explain your model’s predictions, use an interpretable
model family

• Explanations for deep learning models can produce interesting results but
are unreliable

• Interpretability methods like SHAP, LIME, etc can help users reach the
interpretability threshold faster

• Interpretable visualizations can be helpful for debugging
95
Full Stack Deep Learning - UC Berkeley Spring 2021
Thank you!
96

More Related Content

What's hot

Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
doppenhe
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language models
AdventureWorld5
 
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
Edge AI and Vision Alliance
 

What's hot (20)

MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language models
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
IoT platforms – comparison Azure IoT vs AWS IoT
IoT platforms – comparison Azure IoT vs AWS IoTIoT platforms – comparison Azure IoT vs AWS IoT
IoT platforms – comparison Azure IoT vs AWS IoT
 

Similar to Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)

Software Project Management lecture 10
Software Project Management lecture 10Software Project Management lecture 10
Software Project Management lecture 10
Syed Muhammad Hammad
 

Similar to Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021) (20)

OOSE Unit 5 PPT.ppt
OOSE Unit 5 PPT.pptOOSE Unit 5 PPT.ppt
OOSE Unit 5 PPT.ppt
 
Oose unit 5 ppt
Oose unit 5 pptOose unit 5 ppt
Oose unit 5 ppt
 
Software Testing Future and Challenges
Software Testing Future and ChallengesSoftware Testing Future and Challenges
Software Testing Future and Challenges
 
Ch11lect1 ud
Ch11lect1 udCh11lect1 ud
Ch11lect1 ud
 
A Software Testing Intro
A Software Testing IntroA Software Testing Intro
A Software Testing Intro
 
How companies test their software before released to the digital market.pptx
How companies test their software before released to the digital market.pptxHow companies test their software before released to the digital market.pptx
How companies test their software before released to the digital market.pptx
 
Software Project Management lecture 10
Software Project Management lecture 10Software Project Management lecture 10
Software Project Management lecture 10
 
Automated testing 101
Automated testing 101Automated testing 101
Automated testing 101
 
Software engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research resultsSoftware engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research results
 
SplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for DevopsSplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for Devops
 
Chapter 3 Reducing Risks Using CI
Chapter 3 Reducing Risks Using CIChapter 3 Reducing Risks Using CI
Chapter 3 Reducing Risks Using CI
 
Software Testing interview - Q&A and tips
Software Testing interview - Q&A and tipsSoftware Testing interview - Q&A and tips
Software Testing interview - Q&A and tips
 
Software Engineering (Testing Overview)
Software Engineering (Testing Overview)Software Engineering (Testing Overview)
Software Engineering (Testing Overview)
 
Staroletov testing TDD BDD MBT
Staroletov testing TDD BDD MBTStaroletov testing TDD BDD MBT
Staroletov testing TDD BDD MBT
 
Continuous Deployment of your Application @JUGtoberfest
Continuous Deployment of your Application @JUGtoberfestContinuous Deployment of your Application @JUGtoberfest
Continuous Deployment of your Application @JUGtoberfest
 
An introduction to Software Testing and Test Management
An introduction to Software Testing and Test ManagementAn introduction to Software Testing and Test Management
An introduction to Software Testing and Test Management
 
SOFTWARE TESTING W1_watermark.pdf
SOFTWARE TESTING W1_watermark.pdfSOFTWARE TESTING W1_watermark.pdf
SOFTWARE TESTING W1_watermark.pdf
 
Introduction To Testing by enosislearning.com
Introduction To Testing by enosislearning.com Introduction To Testing by enosislearning.com
Introduction To Testing by enosislearning.com
 
Software engineering 20 software testing
Software engineering 20 software testingSoftware engineering 20 software testing
Software engineering 20 software testing
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 

More from Sergey Karayev

Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
Sergey Karayev
 

More from Sergey Karayev (19)

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep Learning
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep Learning
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep Learning
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)

  • 1. Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel Beyond the Black Box: Testing and Explaining ML Systems
  • 2. Full Stack Deep Learning - UC Berkeley Spring 2021 Test score < target performance (in school) 2
  • 3. Full Stack Deep Learning - UC Berkeley Spring 2021 Test score < target performance (real world) 3 How do we know we can trust it?
  • 4. Full Stack Deep Learning - UC Berkeley Spring 2021 What’s wrong with the black box? 4
  • 5. Full Stack Deep Learning - UC Berkeley Spring 2021 What’s wrong with the black box? • Real-world distribution doesn’t always match the offline distribution • Drift, shift, and malicious users • Expected performance doesn’t tell the whole story • Slices and edge cases • Your model your machine learning system • You’re probably not optimizing the exact metrics you care about ≠ 5 What does good test set performance tell us? If test data and production data come from the same distribution, then in expectation the performance of your model on your evaluation metrics will be the same
  • 6. Full Stack Deep Learning - UC Berkeley Spring 2021 How bad is it, really? 6 “I think the single biggest thing holding back the autonomous vehicle industry today is that, even if we had a car that worked, no one would know, because no one is confident that they know how to properly evaluate it” Former ML engineer at autonomous vehicle company
  • 7. Full Stack Deep Learning - UC Berkeley Spring 2021 Goal of this lecture • Introduce concepts and methods to help you, your team, and your users: • Understand at a deeper level how well your model is performing • Become more confident in your model’s ability to perform well in production • Understand the model’s performance envelope, i.e., where you should expect it to perform well and where not 7
  • 8. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Software testing • Testing machine learning systems • Explainable and interpretable AI • What goes wrong in the real world? 8
  • 9. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 9
  • 10. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Software testing • Testing machine learning systems • Explainable and interpretable AI • What goes wrong in the real world? 10
  • 11. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 11 Software testing
  • 12. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 12 Software testing
  • 13. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic types of software tests • Unit tests: test the functionality of a single piece of the code (e.g., a single function or class) • Integration tests: test how two or more units perform when used together (e.g., test if a model works well with a preprocessing function) • End-to-end tests: test that the entire system performs when run together, e.g., on real input from a real user 13 Software testing
  • 14. Full Stack Deep Learning - UC Berkeley Spring 2021 Other types of software tests • Mutation tests • Contract tests • Functional tests • Fuzz tests • Load tests • Smoke tests • Coverage tests • Acceptance tests • Property-base tests • Usability tests • Benchmarking • Stress tests • Shadow tests • Etc 14 Software testing
  • 15. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 15 Software testing
  • 16. Full Stack Deep Learning - UC Berkeley Spring 2021 Testing best practices • Automate your tests • Make sure your tests are reliable, run fast, and go through the same code review process as the rest of your code • Enforce that tests must pass before merging into the main branch • When you find new production bugs, convert them to tests • Follow the test pyramid 16 Software testing
  • 17. Full Stack Deep Learning - UC Berkeley Spring 2021 Best practice: the testing pyramid • Compared to E2E tests, unit tests are: • Faster • More reliable • Better at isolating failures • Rough split: 70/20/10 17 Software testing https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html
  • 18. Full Stack Deep Learning - UC Berkeley Spring 2021 Controversial best practice: solitary tests • Solitary testing (or mocking) doesn’t rely on real data from other units • Sociable testing makes the implicit assumption that other modules are working 18 Software testing https://martinfowler.com/bliki/UnitTest.html
  • 19. Full Stack Deep Learning - UC Berkeley Spring 2021 Controversial best practice: test coverage • Percentage of lines of code called in at least one test • Single metric for quality of testing • But it doesn’t measure the right things — in particular, test quality 19 Software testing https://martinfowler.com/bliki/TestCoverage.html
  • 20. Full Stack Deep Learning - UC Berkeley Spring 2021 Controversial best practice: test-driven development 20 Software testing https://martinfowler.com/bliki/TestCoverage.html
  • 21. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 21
  • 22. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 22 Software testing
  • 23. Full Stack Deep Learning - UC Berkeley Spring 2021 Test in prod, really? 23 Software testing
  • 24. Full Stack Deep Learning - UC Berkeley Spring 2021 Test in prod, really? • Traditional view: • The goal of testing is to prevent shipping bugs to production • Therefore, by definition you must do your testing offline, before your system goes into production • However: • “Informal surveys reveal that the percentage of bugs found by automated tests are surprisingly low. Projects with significant, well-designed automation efforts report that regression tests find about 15 percent of the total bugs reported (Marick, online).” — Lessons Learned in Software Testing • Modern distributed systems just make it worse 24 Software testing
  • 25. Full Stack Deep Learning - UC Berkeley Spring 2021 The case for testing in production Bugs are inevitable, might as well set up the system so that users can help you find them 25 Software testing
  • 26. Full Stack Deep Learning - UC Berkeley Spring 2021 The case for testing in production With sufficiently advanced monitoring & enough scale, it’s a realistic strategy to write code, push it to prod, & watch the error rates. If something in another part of the app breaks, it’ll be apparent very quickly in your error rates. You can either fix or roll back. You’re basically letting your monitoring system play the role that a regression suite & continuous integration play on other teams. 26 Software testing https://twitter.com/sarahmei/status/868928631157870592?s=20
  • 27. Full Stack Deep Learning - UC Berkeley Spring 2021 How to test in prod 27 Software testing • Canary deployments • A/B Testing • Real user monitoring • Exploratory testing
  • 28. Full Stack Deep Learning - UC Berkeley Spring 2021 Test in prod, really. 28 Software testing
  • 29. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 29
  • 30. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 30 Software testing
  • 31. Full Stack Deep Learning - UC Berkeley Spring 2021 CircleCI / Travis / Github Actions / etc • SaaS for continuous integration • Integrate with your repository: every push kicks off a cloud job • Job can be defined as commands in a Docker container, and can store results for later review • No GPUs • CircleCI / Github Actions have free plans • Github actions is super easy to integrate — would start here 31
  • 32. Full Stack Deep Learning - UC Berkeley Spring 2021 Jenkins / Buildkite • Nice option for running CI on your own hardware, in the cloud, or mixed • Good option for a scheduled training test (can use your GPUs) • Very flexible 32
  • 33. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 33
  • 34. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Software testing • Testing machine learning systems • Explainable and interpretable AI • What goes wrong in the real world? 34 ML Testing
  • 35. Full Stack Deep Learning - UC Berkeley Spring 2021 ML systems are software, so test them like software? 35 Software is… • Code • Written by humans to solve a problem • Prone to “loud” failures • Relatively static … But ML systems are • Code + data • Compiled by optimizers to satisfy a proxy metric • Prone to silent failures • Constantly changing ML Testing
  • 36. Full Stack Deep Learning - UC Berkeley Spring 2021 Common mistakes in testing ML systems • Only testing the model, not the entire ML system • Not testing the data • Not building a granular enough understanding of the performance of the model before deploying • Not measuring the relationship between model performance metrics and business metrics • Relying too much on automated testing • Thinking offline testing is enough — not monitoring or testing in prod 36 ML Testing
  • 37. Full Stack Deep Learning - UC Berkeley Spring 2021 Common mistakes in testing ML systems • Only testing the model, not the entire ML system • Not testing the data • Not building a granular enough understanding of the performance of the model before deploying • Not measuring the relationship between model performance metrics and business metrics • Relying too much on automated testing • Thinking offline testing is enough — not monitoring or testing in prod 37 ML Testing
  • 38. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system ML Models vs ML Systems 38 ML Model Prediction system Serving system Storage and preprocessing system Labeling system Model inputs Predictions Feedback Training system Prod data Offline Online
  • 39. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system ML Models vs ML Systems 39 ML Model Prediction system Serving system Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Training system Infrastructure tests
  • 40. Full Stack Deep Learning - UC Berkeley Spring 2021 Infrastructure tests — unit tests for training code • Goal: avoid bugs in your training pipelines • How? • Unit test your training code like you would any other code • Add single batch or single epoch tests that check performance after an abbreviated training run on a tiny dataset • Run frequently during development 40
  • 41. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 41
  • 42. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system ML Models vs ML Systems 42 ML Model Prediction system Serving system Labeling system Model inputs Predictions Feedback Prod data Offline Online Training system Training tests Storage and preprocessing system
  • 43. Full Stack Deep Learning - UC Berkeley Spring 2021 Training tests • Goal: make sure training is reproducible • How? • Pull a fixed dataset and run a full or abbreviated training run • Check to make sure model performance remains consistent • Consider pulling a sliding window of data • Run periodically (nightly for frequently changing codebases) 43
  • 44. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Training system ML Models vs ML Systems 44 ML Model Serving system Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Prediction system Functionality tests
  • 45. Full Stack Deep Learning - UC Berkeley Spring 2021 Functionality tests: unit tests for prediction code • Goal: avoid regressions in the code that makes up your prediction infra • How? • Unit test your prediction code like you would any other code • Load a pretrained model and test prediction on a few key examples • Run frequently during development 45
  • 46. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system ML Models vs ML Systems 46 ML Model Serving system Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Prediction system Evaluation tests Training system
  • 47. Full Stack Deep Learning - UC Berkeley Spring 2021 Evaluation tests • Goal: make sure a new model is ready to go into production • How? • Evaluate your model on all of the metrics, datasets, and slices that you care about • Compare the new model to the old one and your baselines • Understand the performance envelope of the new model • Run every time a new candidate model is created and considered for production 47
  • 48. Full Stack Deep Learning - UC Berkeley Spring 2021 Eval tests: more than just your val score • Look at all of the metrics you care about • Model metrics (precision, recall, accuracy, L2, etc) • Behavioral tests • Robustness tests • Privacy and fairness tests • Simulation tests 48
  • 49. Full Stack Deep Learning - UC Berkeley Spring 2021 Behavioral testing metrics • Goal: make sure the model has the invariances we expect • Types: • Invariance tests: assert that change in inputs shouldn’t affect outputs • Directional tests: assert that change in inputs should affect outputs • Min functionality tests: certain inputs and outputs should always produce a given result • Mostly used in NLP 49 Beyond Accuracy: Behavioral Testing of NLP models with CheckList https://arxiv.org/abs/2005.04118
  • 50. Full Stack Deep Learning - UC Berkeley Spring 2021 Robustness metrics • Goal: understand the performance envelope of the model, i.e., where you should expect the model to fail, e.g., • Feature importance • Sensitivity to staleness • Sensitivity to drift (how to measure this?) • Correlation between model performance and business metrics (on an old version of the model) • Very underrated form of testing 50
  • 51. Full Stack Deep Learning - UC Berkeley Spring 2021 Privacy and fairness metrics 51 Fairness Definitions Explained (https://fairware.cs.umass.edu/papers/Verma.pdf) https://ai.googleblog.com/2019/12/fairness-indicators-scalable.html
  • 52. Full Stack Deep Learning - UC Berkeley Spring 2021 Simulation tests • Goal: understand how the performance of the model could affect the rest of the system • Useful when your model affects the world • Main use is AVs and robotics • This is hard: need a model of the world, dataset of scenarios, etc 52
  • 53. Full Stack Deep Learning - UC Berkeley Spring 2021 Eval tests: more than just your val score • Evaluate metrics on multiple slices of data • Slices = mappings of your data to categories • E.g., value of the “country” feature • E.g., “age” feature within intervals {0 - 18, 18 - 30, 30-45, 45-65, 65+} • E.g., arbitrary other machine learning model like a noisy cat classifier 53
  • 54. Full Stack Deep Learning - UC Berkeley Spring 2021 Surfacing slices SliceFinder: clever way of computing metrics across all slices & surfacing the worst performers 54 Automated Data Slicing for Model Validation: A Big data - AI Integration Approach What-if tool: visual exploration tool for model performance slicing The What-If Tool: Interactive Probing of Machine Learning Models
  • 55. Full Stack Deep Learning - UC Berkeley Spring 2021 Eval tests: more than just your val score • Maintain evaluation datasets for all of the distinct data distributions you need to measure • Your main validation set should mirror your test distribution • When to add new datasets? • When you collect datasets to specify specific edge cases (e.g., “left hand turn” dataset) • When you run your production model on multiple modalities (e.g., “San Francisco” and “Miami” datasets, English and French datasets) • When you augment your training set with data not found in production (e.g., “ 55
  • 56. Full Stack Deep Learning - UC Berkeley Spring 2021 Evaluation reports 56 Slice Accuracy Precision Recall Aggregate 90% 91% 89% Age <18 87% 89% 90% Age 18-45 85% 87% 79% Age 45+ 92% 95% 90% Slice Accuracy Precision Recall Aggregate 94% 91% 90% New accounts 87% 89% 90% Frequent users 85% 87% 79% Churned users 50% 36% 70% Main eval set NYC users
  • 57. Full Stack Deep Learning - UC Berkeley Spring 2021 Deciding whether the evaluations pass • Compare new model to previous model and and a fixed older model • Set threshold on the diffs between new and old for most metrics • Set thresholds on the diffs between slices • Set thresholds against the fixed older model to prevent slower performance “leaks” 57
  • 58. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 58
  • 59. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Training system ML Models vs ML Systems 59 ML Model Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Prediction system Shadow tests Serving system
  • 60. Full Stack Deep Learning - UC Berkeley Spring 2021 Shadow tests: catch production bugs before they hit users • Goals: • Detect bugs in the production deployment • Detect inconsistencies between the offline and online model • Detect issues that appear on production data • How? • Run new model in production system, but don’t return predictions to users • Save the data and run the offline model on it • Inspect the prediction distributions for old vs new and offline vs online for inconsistencies 60
  • 61. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Prediction system Training system ML Models vs ML Systems 61 ML Model Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online AB Tests Serving system
  • 62. Full Stack Deep Learning - UC Berkeley Spring 2021 AB Testing: test users’ reaction to the new model • Goals: • Understand how your new model affects user and business metrics • How? • Start by “canarying” model on a tiny fraction of data • Consider using a more statistically principled split • Compare metrics on the two cohorts 62 https://levelup.gitconnected.com/the-engineering-problem-of-a-b-testing-ac1adfd492a8
  • 63. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 63
  • 64. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Prediction system Training system ML Models vs ML Systems 64 ML Model Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Monitoring (next time!) Serving system
  • 65. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Serving system Prediction system Training system ML Models vs ML Systems 65 ML Model Storage and preprocessing system Model inputs Predictions Feedback Prod data Offline Online Labeling tests Labeling system
  • 66. Full Stack Deep Learning - UC Berkeley Spring 2021 Labeling tests: get more reliable labels • Goals: • Catch poor quality labels before
 they corrupt your model • How? • Train and certify the labelers • Aggregate labels of multiple labelers • Assign labelers a “trust score” based on how often they are wrong • Manually spot check the labels from your labeling service • Run a previous model on new labels and inspect the ones with biggest disagreement 66
  • 67. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Labeling system Serving system Prediction system Training system ML Models vs ML Systems 67 ML Model Model inputs Predictions Feedback Prod data Offline Online Expectation tests Storage and preprocessing system
  • 68. Full Stack Deep Learning - UC Berkeley Spring 2021 Expectation tests: unit tests for data • Goals: • Catch data quality issues before they make your way into your pipeline • How? • Define rules about properties of each of your data tables at each stage in your data cleaning and preprocessing pipeline • Run them when you run batch data pipeline jobs 68
  • 69. Full Stack Deep Learning - UC Berkeley Spring 2021 What kinds of expectations? 69 https://greatexpectations.io/
  • 70. Full Stack Deep Learning - UC Berkeley Spring 2021 How to set expectations? 70 • Manually • Profile a dataset you expect to be high quality and use those thresholds • Both
  • 71. Full Stack Deep Learning - UC Berkeley Spring 2021 Challenges operationalizing ML tests • Organizational: data science teams don’t always have the same norms around testing & code review • Infrastructure: cloud CI/CD platforms don’t have great support for GPUs or data integrations • Tooling: no standard toolset for comparing model performance, finding slices, etc • Decision making: Hard to determine when test performance is “good enough” 71 ML Testing
  • 72. Full Stack Deep Learning - UC Berkeley Spring 2021 How to test your ML system the right way • Test each part of the ML system, not just the model • Test code, data, and model performance, not just code • Testing model performance is an art, not a science • The goal of testing model performance is to build a granular understanding of how well your model performs, and where you don’t expect it to perform well • Build up to this gradually! Start here: • Infrastructure tests • Evaluation tests • Expectation tests 72
  • 73. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 73
  • 74. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Software testing • Testing machine learning systems • Explainable and interpretable AI • What goes wrong in the real world? 74 Explainability
  • 75. Full Stack Deep Learning - UC Berkeley Spring 2021 Some definitions • Domain predictability: the degree to which it is possible to detect data outside the model’s domain of competence • Interpretability: the degree to which a human can consistently predict the model's result [1] • Explainability: the degree to which a human can understand the cause of a decision [2] 75 Explainability [1] Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. "Examples are not enough, learn to criticize! Criticism for interpretability." Advances in Neural Information Processing Systems (2016).↩ [2] Miller, Tim. "Explanation in artificial intelligence: Insights from the social sciences." arXiv Preprint arXiv:1706.07269. (2017).
  • 76. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction (feature importance, SHAP) • Understand the contribution of training data points to the prediction 76 Explainability
  • 77. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction (feature importance, SHAP) • Understand the contribution of training data points to the prediction 77 Explainability
  • 78. Full Stack Deep Learning - UC Berkeley Spring 2021 Interpretable families of models • Linear regression • Logistic regression • Generalized linear models • Decision trees 78 Explainability Interpretable and explainable (to the right user) Not very powerful
  • 79. Full Stack Deep Learning - UC Berkeley Spring 2021 What about attention? 79 Explainability Interpretable Not explainable
  • 80. Full Stack Deep Learning - UC Berkeley Spring 2021 Attention maps are not reliable explanations 80 Explainability
  • 81. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 81
  • 82. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction (feature importance, SHAP) • Understand the contribution of training data points to the prediction 82 Explainability
  • 83. Full Stack Deep Learning - UC Berkeley Spring 2021 Surrogate models • Train an interpretable surrogate model on your data and your original model’s predictions • Use the surrogate’s interpretation as a proxy for understanding the underlying model 83 Explainability Easy to use, general If your surrogate is good, why not use it? And how do we know it decides the same way
  • 84. Full Stack Deep Learning - UC Berkeley Spring 2021 Local surrogate models (LIME) • Pick a data point to explain the prediction for • Perturb the data point and query the model for the prediction • Train a surrogate on the perturbed dataset 84 Explainability Widely used in practice. Works for all data types Hard to define the perturbations. Explanations are unstable and can be manipulated
  • 85. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction • Understand the contribution of training data points to the prediction 85 Explainability
  • 86. Full Stack Deep Learning - UC Berkeley Spring 2021 Data viz for feature importance 86 Explainability Partial dependence plot Individual conditional expectation
  • 87. Full Stack Deep Learning - UC Berkeley Spring 2021 Permutation feature importance • Pick a feature • Randomize its order in the dataset and see how that affects performance 87 Explainability Easy to use. Widely used in practice Doesn’t work for high-dim data. Doesn’t capture feature interdependence
  • 88. Full Stack Deep Learning - UC Berkeley Spring 2021 SHAP (SHapley Additive exPlanations) • Intuition: • How much does the pretense of this feature affect the value of the classifier, independent of what other features are present?
 88 Explainability Works for a variety of data. Mathematically principled Tricky to implement. Still doesn’t provide explanations
  • 89. Full Stack Deep Learning - UC Berkeley Spring 2021 Gradient-based saliency maps • Pick an input (e.g., image) • Perform a forward pass • Compute the gradient with respect to the pixels • Visualize the gradients 89 Explainability Easy to use. Many variants that produce more interpretable results. Difficult to know whether the explanation is correct. Fragile. Unreliable.
  • 90. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 90
  • 91. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction • Understand the contribution of training data points to the prediction 91 Explainability
  • 92. Full Stack Deep Learning - UC Berkeley Spring 2021 Prototypes and criticisms 92
  • 93. Full Stack Deep Learning - UC Berkeley Spring 2021 Do you need “explainability”? • Regulators say my model needs to be explainable • What do they mean? • But yeah you should probably do explainability • Users want my model to be explainable • Can a better product experience alleviate the need for explainability? • How often will they be interacting with the model? If frequently, maybe interpretability is enough? • I want to deploy my models to production more confidently • You really want domain predictability • However, interpretable visualizations could be a helpful tool for debugging 93
  • 94. Full Stack Deep Learning - UC Berkeley Spring 2021 Can you explain deep learning models? • Not really • Known explanation methods aren’t faithful to original model performance • Known explanation methods are unreliable • Known explanations don’t fully explain models’ decisions 94 Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (https://arxiv.org/pdf/1811.10154.pdf)
  • 95. Full Stack Deep Learning - UC Berkeley Spring 2021 Conclusion • If you need to explain your model’s predictions, use an interpretable model family • Explanations for deep learning models can produce interesting results but are unreliable • Interpretability methods like SHAP, LIME, etc can help users reach the interpretability threshold faster • Interpretable visualizations can be helpful for debugging 95
  • 96. Full Stack Deep Learning - UC Berkeley Spring 2021 Thank you! 96