Why are anomalies important? Because they tell us a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies.
There are many anomaly detection algorithms available. Most algorithms have parameters. Parameters are a tricky business because users need to set them. Sometimes it is not clear how to set these parameters. For example, there are anomaly detection algorithms that use kernel density estimates to detect anomalies. But they require the user to set the bandwidth. Setting the bandwidth for anomaly detection is different from setting the bandwidth for general kernel density estimation. Especially in high dimensions this is not an obvious task.
In this talk, we introduce lookout, a new approach that uses topological data analysis to select the bandwidth for anomaly detection. Using this bandwidth lookout uses leave-one-out kernel density estimates and extreme value theory to detect anomalies.
We also define the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly increases.
The R package lookout implements this algorithm.
2. Why anomalies?
• They tell a different story
• Fraudulent credit card transactions amongst billions of legitimate
transactions
• Computer network intrusions
• Astronomical anomalies – solar flares
• Weather anomalies – tsunamis
• Stock market anomalies – heralding a crash?
• Training a model on certain fraud/intrusions/cyber attacks is not
optimal, because there are new types of fraud/attacks, always!
• You want to be alerted when weird things happen.
• Anomaly detection is used in these applications.
4. Current
challenges
AD methods rank observations in terms of
anomalousness
• They don’t identify anomalies
• So, the user needs to define a threshold and
identify anomalies
High false positives
• Do not want an “alarm factory” – confidence in the
system goes down
Parameters need to be defined by the user
• But expert knowledge is needed
5. Overview
Anomaly
persistence
When parameters are changed, do
the same anomalies get identified?
lookout –
an
anomaly
detection
method
Kernel density estimates
Topological data analysis/persistent
homology
Extreme value theory
6. KDE for anomaly detection
• What do we want?
• Anomalies to have much lower kde values than other points.
• Why?
• Because anomalies are in low density regions.
• The literature on bandwidth selection focusses on representing the
data
• Minimize MISE (Mean Integrated Square Error)
• But, this doesn’t work for us.
11. We are interested in . . .
• The end-point diameter (death
diameters) sequences
• We want the maximum gap
• Diameter that starts the
maximum gap = 𝑑
• ℎ = 5 𝑑 for Epanechnikov
kernel
12. • Compute the kde values
• Anomalies will have the very low kde values
• We can rank the points using the kde values
• Low kde – anomalous
• High kde – not anomalous
Using this bandwidth
13. But, we want to identify anomalies!
Just because the kde is low, is it an
anomaly?
14. We want to have a cut off!
For that we use Extreme Value
Theory!
15. lookout - EVT – Peak Over Threshold method
(POT)
• Pick a threshold – 90%
• Model the exceedences
• Generalized Pareto distribution
• fit a GPD to –log of kde values
• Cut off threshold
• Leave-one-out kde for outlier
detection
• Identify anomalies
16. Anomaly Persistence
• What if a data-point is identified
as an anomaly for different
bandwidth values?
• Visual representation of
anomaly persistence
• Big picture
17. Example 1
2D normal distribution, with outliers at the far end.
The outlying indices are 501 - 505
The persistence diagram. The outliers get identified
for a large range of bandwidth values.
17
18. Example 2
2D bimodal distribution, with outliers in the trough.
The outliers have indices 1001 - 1005
The persistence diagram. Again, the outliers
get identified for a large range of bandwidth values.
18
19. Example 4
Points in an annulus with anomalies in the middle.
Anomalies have indices 1001 - 1010
The persistence diagram.
19
20. Practical advantages of lookout
The user does not need
to specify a bandwidth
parameter
•The user can be
anyone – not
necessarily a
statistician
EVT based methods
have low false positive
rates
•Attractive for many
applications
•Not an alarm factory
21. Summary
• lookout - a EVT based method to find anomalies (using TDA)
• R package lookout is on CRAN
• Preprint available
• https://bit.ly/lookoutliers