08448380779 Call Girls In Friends Colony Women Seeking Men
Â
Robustness Metrics for ML Models based on Deep Learning Methods
1. Robustness Metrics for ML
Models based on Deep
Learning Methods
Davide Posillipo
Alkemy - Data Science Milan Meetup - AI Guild
21/07/2022, Milan
2. Who am I?
Hi! My name is Davide Posillipo.
I studied Statistics.
For 8+ years Iâve been working on
projects that require using data to
answer questions.
Freelancer Data Scientist and Teacher.
Currently building the Alkemyâs Deep
Learning & Big Data Department.
Happy to connect and work together on
new ideas and projects. Get in touch!
5. MNIST: a simple case
âą MNIST: solved problem
âą LeNet: Deep Convolutional Neural Network, 99% accuracy
Predicted
digit
6. What happens if
we pass a
chicken to
LeNet?
The chicken comes from a
di
ff
erent distribution.
What should the model do?
7. What happens if we pass a chicken to LeNet?
We expect this output:
?
8. What happens if we pass a chicken to LeNet?
⊠but we obtain this result:
6, with probability 0.9
9. A less obvious example: Fashion MNIST
63.4% of examples are classi
fi
ed with a probability
greater than 0.99
74.3% of examples are classi
fi
ed with a probability
greater than 0.95
88.9% of examples are classi
fi
ed with a probability
greater than 0.75
LeNet (trained for MNIST) produces âcon
fi
dentâ predictions for the Fashion MNIST
10. Can we trust deep learning predictions?
âą There are scenarios where itâs crucial to avoid meaningless
predictions (e.g. healthcare, high precision manufacturing,
autonomous drivingâŠ): safety in AI
âą No prediction better than random guessing
âą Need of some robustness metrics
11. Robustness metrics: how should it be?
A robustness metrics should tell us if the prediction is reliable or if it
is some sort of random guessing. It should provide some measure of
con
fi
dence in our prediction.
âą Computable without labels
âą Computable in âreal-timeâ
âą Easy to plug-in into a working ML pipeline
âą Low false positive rate, but also low false negative rate
âą High e
ff
ectiveness in detecting the anomaly points (low false negative
rate)
âą Low rate of discarded âgood predictionsâ (low false positive rate)
12. A few words about Robustness
Robustness: the quality of an entity that doesnât break too easily if
stressed in some way.
Canonical approaches in Statistics involve some modi
fi
cation of
the predictive model/estimator (âIf you get into a stressful
situation, handle it betterâ), with often a loss of performance as
side e
ff
ect.
Tonight we focus on a di
ff
erent approach: we donât modify our
models but we protect them from threats (âDonât get into stressful
situationsâ). We look for the chickens and keep them away from
our model.
Fight-or-
fl
ight(-or-freeze) response: we choose to
fl
y!
13. Lack of robustness is a risk
There is an increasing attention in risks
associated with AI applications.
EU proposed an AI regulation that could
become applicable to industry players starting
from 2024 (https://digital-
strategy.ec.europa.eu/en/policies/regulatory-
framework-ai)
The Commission is proposing the
fi
rst-ever
legal framework on AI, which addresses the
risks of AI and positions Europe to play a
leading role globally.
At some point, robustness will be a mandatory
requirements of ML models in production, and
lack of it will be considered a risk.
14. Robustness metrics: two approaches
âą GAN-based approach
âą Decide if a prediction is worth the risk, checking the
âstabilityâ of the classi
fi
er for the new input data
âą VAE-based approach
âą Decide if a prediction is worth the risk, checking the true
âoriginâ of the new input data
16. GAN in a nutshell
âą Generative Adversarial
Networks
âą Composed of two networks:
âą Generator: generates
inputs from random noise
âą Discriminator: decides if
the inputs from the
Generator are authentic
or arti
fi
cial
âą After many iterations, the
model learns to create
inputs really close to the
authentic ones
17. A fancier GAN: WGAN
âą Wasserstein Generative Adversarial Networks
âą It use Wasserstein distance as GAN loss function
âą Wasserstein Distance is a measure of the distance between two
probability distributions
âą Fore more info: https://lilianweng.github.io/posts/2017-08-20-
gan/
18. âą Train a Generator (from random noise to the input space)
âą Train a Discriminator (it tries to understand if a point is true or
generated by the Generator)
âą The Generator learns how to fool the Discriminator
âą Train an Inverter (from the input space to the latent
representation)
âą The Inverter learns how the Generator maps from the
random noise to the input space
Adding the Inverter
20. WGAN + Inverter: robustness metrics
âą Given a new data point x,
fi
nd the closest point to x in the latent space
that is, once translated back to the original space, able to confound the
classi
fi
er f
âą Robustness metrics: the distance ( delta_z = z* - I(x) ) between these two
points, in the latent representation
âą If delta_z is âsmallâ, the classi
fi
er is not con
fi
dent about its prediction
(the classi
fi
er âchanges its mindâ too easily)
âą If delta_z is âbigâ, the classi
fi
er is con
fi
dent about the prediction (the
classi
fi
er âknows what it wantsâ)
Rule: if delta_z < threshold then do not predict
22. WGAN + Inverter: Training and
distribution of delta_z
Distribution of delta_z computed on 500 test set
input points
23. WGAN + Inverter: Results
Distribution of delta_z computed on 500 test set input
points
Delta_z for the chicken: 1.8e-06 (< 5.3e-02)
Using the 5° percentile test delta_z as threshold, we could get a 5% as false
positive rate and discard the meaningless predictions for the chicken picture.
5° percentile: 5.3e-02
24. WGAN + Inverter: Results with Fashion
MNIST
Using the 5° percentile test delta_z as threshold, we could get a 5% as false
positive rate but a 34.4% of lost good predictions (false negative rate)!
Distribution of delta_z computed on 500 test set input
points
5° percentile: 5.3e-02
We need to reduce the false negative rate!
26. AutoEncoder in a nutshell
âą VAE: Variational AutoEncoder
âą Composed of two neural networks:
âą Encoder: takes the inputs and âcompressâ them in a deep latent representation
âą Decoder: takes the deep representation and creates back the original input as
much as it can
âą After many iterations, the model learns the best âlatent representationâ (kind of
compression) of the training data
27. From AE to Variational AutoEncoder
Instead of mapping the input into aÂ
fi
xed vector, we map it into a distribution.
In this way we learn the likelihood p(x|z), in this context called probabilistic decoder,
that we can use to generate from the latent space.
The encoding happens likewise: we learn the posterior p(z|x) in this context called
probabilistic encoder.
For more info: https://lilianweng.github.io/posts/2018-08-12-vae/
28. Variational Autoencoder: robustness
metrics
The loss function of a variational autoencoder is an indirect measure of the probability
that an observation comes from the same distribution that generated the training set.
An âunlikelyâ input will have troubles in the encoding-decoding process = high loss
value.
Robustness metrics: loss function
âą âBigâ loss -> Not robust prediction: the new input data doesnât come from the
training set underlying distribution
âą âSmallâ loss -> Robust prediction: the new input data comes from the training
set underlying distribution
Notice that with this approach we donât need the classi
fi
er f for the robustness metrics
computation!
Rule: if VAE_loss > threshold then do not predict
31. Variational Autoencoder: Results
Distribution of VAE losses computed on 10.000 test
set input points
Using the maximum test loss as threshold, we could get a 0% as false positive
rate and discard the meaningless predictions for the chicken picture.
VAE loss for the chicken: 309.4 (>212)
32. Variational Autoencoder: Results with
Fashion MNIST
Distribution of VAE losses computed on
10.000 test set input points
Using the maximum test loss as threshold, we could get a 0% as false positive
rate and a 3.55% of lost good predictions (false negative rate)
Using the 95° percentile test loss as threshold, we could get a 5% as false
positive rate and a 0.25% of lost good predictions (false negative rate)
33. Variational Autoencoder: robustness
metrics into production
âą Train your classi
fi
er
âą Train a VAE on your training set
âą Get the distribution of the VAE losses on your test set
âą De
fi
ne a threshold more or less âconservativeâ
âą Implement a âconditional classi
fi
erâ:
35. Wrapping up: pros and cons of this
approach
Pros
âą No need to modify your predictive models
âą The same âmonitoringâ system can be used for di
ff
erent ML
models (for a given dataset)
âą Applicable to any kind of data (tabular, images, âŠ)
âą âEasyâ to explain
âą Easy to plug-in into existing pipelines
Cons
âą Arbitrary thresholds must be set by the data scientist
âą Introducing a further model that needs to be maintained
36. Thank you for your
attention!
https://www.linkedin.com/in/davide-posillipo/