This document provides an overview of machine learning application testing. It discusses common mistakes in data science like cherry picking and false causality. It describes different types of machine learning tasks like supervised classification and unsupervised clustering. The document outlines how to test various parts of a machine learning application including the data, model, and different phases. It provides examples of testing the boundaries, detecting outliers, and using generative adversarial networks. Finally, it discusses the role of a QA engineer in gathering data to validate a system works for non-standard situations and does not cause harm.
20. CLASSIFICATION
TRAIN data set
Main hormone Long
hair
has_hotdog Sex
testosterone 0 1 male
estrogen 1 0 female
testosterone 0 1 male
estrogen 1 0 female
testosterone 1 1 male
testosterone 0 1 male
testosterone 0 0 male
testosterone 0 1 male
testosterone 1 1 male
testosterone 0 1 male
21. CLASSIFICATION
TRAIN data set
Main hormone Long
hair
has_hotdog Sex
testosterone 0 1 male
estrogen 1 0 female
testosterone 0 1 male
estrogen 1 0 female
testosterone 1 1 male
testosterone 0 1 male
testosterone 0 0 male
testosterone 0 1 male
testosterone 1 1 male
testosterone 0 1 male
Imbalanced data
20% female
80% male
22. Imbalanced data
Over-sampling minority class under-sampling majority class Both
library(ROSE)
undersampling_result <- ovun.sample(Class ~ .,
data = Dataset,
method = {“over”,“under”, “both”})
23. CLASSIFICATION
Real life dataset
Main hormone Long
hair
has_hotdog Sex
estrogen 1 0
estrogen 1 0
testosterone 0 1
testosterone 0 0
estrogen 1 1
testosterone 0 1
testosterone 0 0
24. CLASSIFICATION
Define object class
Main hormone Long
hair
hotdog Sex
estrogen 1 0 female
estrogen 1 0 female
testosterone 0 1 male
testosterone 0 0 female
estrogen 1 1 male
testosterone 0 1 male
testosterone 0 0 male
51. Outlier detection
library(dbscan)
furniture_lof <- lof(scale(furniture), k = 5)
Interpreting LOF
LOF is a ratio of densities
LOF > 1more likely to be anomalous
LOF ≤ 1less likely to be anomalous
Large LOF values indicate more isolated points
52. Outlier detection
# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(
x = names(train_dataset),
training_frame = train_dataset,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
sparse = TRUE,
l1 = 1e-4,
epochs = 100)
# Compute reconstruction error with the Anomaly
# detection app (MSE between output and input layers)
Detected_anomalies<- h2o.anomaly(anomaly_model, test_dataset)
67. ML testing
OBJECT: ML App, Model, data, process
SUBJECT: QA engineer, data analyst, data scientist, ML-
engineer
GOAL: find unexpected object behavior for improving object
68. What is ML applications errors
Wrong:
decision -- binary, multi class classification
prediction -- regression, forecasting
answer (generation) -- speech generation, picture generation,
Not enough Accuracy (Precision and Recall):
particular situation -- detecting (edges of) object (detect target
on (medecine) battlefield)
big amount of data -- ROC-AUC
69. Changes in testing philosophy
Text
Traditional software ML software
Some FIXED expected results
Sorted list for all situation
one IN one OUT
Some PROBABLE value
Arranged list for particular situation
multiple IN multiple OUT
70. Common data science mistakes
•Cherry-Picking
•Data Dredging
•False Causality
•Cobra Effect
•Survivorship Bias
•Gerrymandering
•Sampling Bias
•Gambler’s Fallacy
link
•Hawthorne Effect
•Regression Fallacy
•Simpson’s Paradox
•McNamara Fallacy
•Overfitting
•Publishing Bias
•Relying only on Summary
Metrics (Anscombe )
71. What can be tested
• Data
• Feature
• Entities
• Model
• Phases
• Performance
• Workflow
72. Application workflow
UI
Not ML part
ML part
Not ML part
UI
Interact with user
gather data
return data
validate right answers
inform user about errors possibility
validate user knowledge for validating right or wrong
answer
73. Application workflow
UI
Not ML part
ML part
Not ML part
UI
transform data
integration with third party systems
API actions
form answers
add business rules
filtering and wrangling
error handling
outliers detection
outliers handling
missing data handling
invalidation new rules with ML actions
74. Application workflow
UI
Not ML part
ML part
Not ML part
UI
Interact with user
gather data
return data
validate right answers
inform user about errors possibility
validate user knowledge for validating right or wrong
answer
75. Application workflow
UI
Not ML part
ML part
Not ML part
UI
Integrations
end-to-end (system)
reinforcement process
new (absence) of data (rules) handling
Require:
big amount of data
supervised different situations
full automatization
76. QA engineer task:
Interpret cases when application
not work
work not enough accurate
work in non standard situation
detect situation when application can damage others
Gather data for taking decision
interpret negative cases on outliers
False positive
False negative
prepare special controversial data for validating system
pictures with specific
objects
noise
prepare controversial situation when application can generate errors
test and research existing solutions (kaggle)
77. ML steps to reproduce:
all entities which were wrong classified
require understand why
understand their cluster
give possibility to detect them separately
wrong measurement metric
accuracy on big amount of data
validate system with giving controversial data
78. BUG (issue) report
Statuses:
• does not work
• wrong work
• work not enough accurate
• work not accurate
• work not enough fast
• WORK ON DEV SAMPLE
79. Who wants know more?
If we collect at least 200
interested requests –
We will create small
course (smart talk or
meetup series) for this.
https://forms.gle/sYM1Rhc5MZXi76Di9