Mathematics of anomalies

Lookout!
Persisting
anomalies ahead
Sevvandi Kandanaarachchi, Rob Hyndman
Preprint - https://bit.ly/lookoutliers

Why anomalies?
• They tell a different story
• Fraudulent credit card transactions amongst billions of legitimate
transactions
• Computer network intrusions
• Astronomical anomalies – solar flares
• Weather anomalies – tsunamis
• Stock market anomalies – heralding a crash?
• Training a model on certain fraud/intrusions/cyber attacks is not
optimal, because there are new types of fraud/attacks, always!
• You want to be alerted when weird things happen.
• Anomaly detection is used in these applications.

Current
challenges
AD methods rank observations in terms of
anomalousness
• They don’t identify anomalies
• So, the user needs to define a threshold and
identify anomalies
High false positives
• Do not want an “alarm factory” – confidence in the
system goes down
Parameters need to be defined by the user
• But expert knowledge is needed

Overview
Anomaly
persistence
When parameters are changed, do
the same anomalies get identified?
lookout –
an
anomaly
detection
method
Kernel density estimates
Topological data analysis/persistent
homology
Extreme value theory

KDE for anomaly detection
• What do we want?
• Anomalies to have much lower kde values than other points.
• Why?
• Because anomalies are in low density regions.
• The literature on bandwidth selection focusses on representing the
data
• Minimize MISE (Mean Integrated Square Error)
• But, this doesn’t work for us.

Bandwidth, KDE and anomalies
• Anomalies in the middle
• Indices 1001 -1010
• The bandwidth minimising MISE is
0.018
• Increasing bandwidth of KDE
• Lowest 10 KDE points (their indices)
• Want anomalies to have lowest KDE
0.05 0.2 0.35 0.5 0.65 0.8 0.95 1.1 1.25 1.4
232 232 1010 1010 1006 1006 1006 495 495 495
1010 446 1001 1001 1009 1009 1009 843 843 843
424 1010 1008 1008 1005 1005 1005 486 486 486
359 495 1004 1004 1002 1002 1002 1006 979 166
963 1001 1003 1002 1004 1004 1004 1009 166 979
814 975 1002 1003 1007 1007 1007 1005 948 948
70 1008 1007 1007 1003 1003 1003 1002 964 964
257 799 1006 1006 1008 1001 1001 1004 832 832
511 843 1009 1009 1001 1008 1008 1007 110 147
458 511 1005 1005 1010 1010 1010 1003 147 110

How do we select a bandwidth
appropriate for anomaly
detection?

In comes persistent homology
• Methodology in topological data analysis

With an anomaly
Dimension 0 – connected components

We are interested in . . .
• The end-point diameter (death
diameters) sequences
• We want the maximum gap
• Diameter that starts the
maximum gap = 𝑑
• ℎ = 5 𝑑 for Epanechnikov
kernel

• Compute the kde values
• Anomalies will have the very low kde values
• We can rank the points using the kde values
• Low kde – anomalous
• High kde – not anomalous
Using this bandwidth

But, we want to identify anomalies!
Just because the kde is low, is it an
anomaly?

We want to have a cut off!
For that we use Extreme Value
Theory!

lookout - EVT – Peak Over Threshold method
(POT)
• Pick a threshold – 90%
• Model the exceedences
• Generalized Pareto distribution
• fit a GPD to –log of kde values
• Cut off threshold
• Leave-one-out kde for outlier
detection
• Identify anomalies

Anomaly Persistence
• What if a data-point is identified
as an anomaly for different
bandwidth values?
• Visual representation of
anomaly persistence
• Big picture

Example 1
2D normal distribution, with outliers at the far end.
The outlying indices are 501 - 505
The persistence diagram. The outliers get identified
for a large range of bandwidth values.
17

Example 2
2D bimodal distribution, with outliers in the trough.
The outliers have indices 1001 - 1005
The persistence diagram. Again, the outliers
get identified for a large range of bandwidth values.
18

Example 4
Points in an annulus with anomalies in the middle.
Anomalies have indices 1001 - 1010
The persistence diagram.
19

Practical advantages of lookout
The user does not need
to specify a bandwidth
parameter
•The user can be
anyone – not
necessarily a
statistician
EVT based methods
have low false positive
rates
•Attractive for many
applications
•Not an alarm factory

Summary
• lookout - a EVT based method to find anomalies (using TDA)
• R package lookout is on CRAN
• Preprint available
• https://bit.ly/lookoutliers

Mathematics of anomalies

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Mathematics of anomalies

Semelhante a Mathematics of anomalies (20)

Mais de CSIRO

Mais de CSIRO (15)

Último

Último (20)

Mathematics of anomalies