SlideShare a Scribd company logo
1 of 31
ANOMALY DETECTION ALGORITHMS
AND TECHNIQUES FOR REAL-WORLD
SYSTEMS
Manojit Nandi
STEALTHbits Technologies
@mnandi92
OUTLINE
Overview of anomaly detection
Data Streaming
Density-based Anomaly Detection
Time Series based Methods
Risk Scores and Testing
WHAT ARE ANOMALIES?
Hard to define; “You’ll know it when you see it”.
Generally, anomalies are “anything that noticeably different” from the
expected.
One important thing to keep in mind is that what is considered anomalous
now may not be considered anomalous in the future.
Different approaches to anomaly detection
We can develop a statistical model of
normal behavior. Then we can test how
likely an observation is under the model.
We can take a machine learning approach
and use a classifier to classify data
points as “normal” or “anomalous”.
In this talk, I’m going to cover algorithms
specially designed to detect anomalies. Photo Credit: Microsoft Azure
Anomalies in Data Streams
Problem: We have data streaming in continuously, and we want to identify
anomalies in real-time.
Constraint: We can only examine the last 100 events in our sliding window.
In data streaming problems, we are “restricted” to quick-and-dirty methods
due to the limited memory and need for rapid action.
Heuristic: Z-Scores
Z-scores are often used as a test statistic to
measure the “extremeness” of an
observation in statistical hypothesis
testing.
How many standard deviations away from
the mean is a particular observation?
Moving Averages and Moving St. Dev.
As the data comes in, we keep track of the average and the standard
deviation of the last n data points.
For each new data point, update the average and the standard deviation.
Using the new average and standard deviation, compute the z-score for this
data point.
If the Z-score exceeds some threshold, flag the data point as anomalous.
Standard Deviation not robust
Standard deviation (and mean, to a lesser
extend) is highly sensitive to extreme
values.
One extreme value can drastically increase
the standard deviation.
As a result, the Z-scores for other data points
dramatically decreases as well.
Mathematics of Robustness
The arithmetic mean s is the number which solves the following
minimization problem:
The median m is the number which solves the following minimization
problem:
Visualizing
robustness
Mean vs. Outlier St. Dev vs. Outlier
Median Absolute Deviation
The Median Absolute Deviation provides a robust way to measure the
“spread” of the data.
Essentially, the median of the deviations from the “center” (the median).
Provides a more robust measure of “spread” compared to standard
deviation.
Median Absolute Deviation - Code
Modified Z-Scores
Now, we can use the median and the
MAD to compute the modified Z-
score for each data point.
We then use the modified z-score to
perform statistical hypothesis testing,
in the same manner as the standard
z-score.
Density-Based Anomaly Detection
We have a bunch of points in
some n-dimensional space.
Which ones are “noticeably”
different from the others.
Probabilistic Density: A Primer
In statistics, we assume all data is
generated according to some
probability distribution.
The goal of density-based methods is to
estimate the underlying probability
density function, based on the data.
Many density-based methods, such as
DBSCAN and Level Set Tree Clustering.
Local Outlier
Factor
Goal: Quantify the relative density
about a particular data point.
Intuition: The anomalies should be
more isolated compared to
“normal” data points.
LOF estimates density by looking at
a small neighborhood about each
point.
K-Distance
For each datapoint, compute the
distance to the Kth-nearest
neighbor.
Meta-heuristic: The K-distance gives
us a notion of “volume”.
The more isolated a point is, the
larger its k-distance.
Reachability Distance
Reachability-Distance(A,B) =
Max(K-Dist(B), Dist(A,B))
“Do your close neighbors see you
as one of their close
neighbors”.
Local Reachability
Density
Now, we are going to estimate the local density about each point.
For each data point, compute the average reachability-distance to its K-
nearest neighbors.
The Local Reachability Density (LRD) of a data point A is defined as the
inverse of this average reachability-distance.
Local Outlier
Factor
LOF score is the ratio of point
A’s density to the average
density of its neighbors.
Outliers come from less dense
areas, so the ratio is higher for
outliers.
Interpreting LOF scores
Normal points have LOF scores between 1 and 1.5.
Anomalous points have much higher LOF scores.
If a point has a LOF score of 3, then this means the average density of this
point’s neighbors is about 3x more dense than its local density.
Time Series based Anomalies
Given activity indexed by time, can we identify extreme spikes and troughs in
the time series.
Want to find global anomalies and local anomalies.
Source: Twitter Engineering Blog
Seasonal Hybrid - ESD
Algorithm invented at Twitter in 2015.
Two components:
Seasonal Decomposition: Remove seasonal patterns from the time series
ESD: Iteratively test for outliers in the time series.
Remove periodic patterns from the time series, then identify anomalies with
the remaining “core” of the time series.
Seasonal Decomposition
Time Series Decomposition breaks a time series down into three
components:
a. Trend Component
b. Seasonal Component
c. Residual (or Random) Component
The trend component contains the “meat” of the time series that we are
interested in.
Photo Credit: Penn State University
Extreme Studentized Deviate
ESD is a statistical procedure to iteratively test for outliers in a dataset.
Specify the alpha level and the maximum number of anomalies to identify.
ESD naturally applies a statistical correction to compensate for multiple
hypothesis testing.
Extreme Studentized Deviate
1. For each datapoint, compute a G-Score (absolute value of Z-Score)
2. Take the point with the highest G-Score.
3. Using the pre-specified alpha value, compute a critical value.
4. If the G-Score of the test point is greater than the critical value, flag the
point as anomalous.
5. Remove this point from the data.
Seasonal Hybrid - ESD: Example
Photo Credit: Twitter Engineering Blog
Risk Scores
When doing anomaly detection in practice, you want to treat it more as a
regression problem rather than a classification problem.
Best practices recommend calculating the likelihood that a data point or
event is anomalous and converting the likelihood into a risk score.
Risk scores should be from 0-100 because people more intuitively
understand this scale.
Testing your algorithms
Be sure to test algorithms in different environments. Just because it
performs well in one environment does not mean it will generalize well to
other environments.
Since anomalies are rare, create synthetic datasets with built-in anomalies. If
you can’t identify the built-in anomalies, then you have a problem.
You should be constantly testing and fine-tuning your algorithms, so I
recommend building a test harness to automate testing.
Questions?

More Related Content

What's hot

Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New BossAndreas Dewes
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
Data Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics MethodologyData Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics MethodologyRupak Roy
 
Chapter 8 review
Chapter 8 reviewChapter 8 review
Chapter 8 reviewdrahkos1
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)Abhimanyu Dwivedi
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
Point and Interval Estimation
Point and Interval EstimationPoint and Interval Estimation
Point and Interval EstimationShubham Mehta
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Sri Ambati
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithmsShalitha Suranga
 
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Edureka!
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised LearningSara Hooker
 
Predicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsPredicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsMichele Vincent
 

What's hot (18)

Machine learning session2
Machine learning   session2Machine learning   session2
Machine learning session2
 
Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New Boss
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Data Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics MethodologyData Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics Methodology
 
Chapter 8 review
Chapter 8 reviewChapter 8 review
Chapter 8 review
 
Estimation Theory
Estimation TheoryEstimation Theory
Estimation Theory
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
Point and Interval Estimation
Point and Interval EstimationPoint and Interval Estimation
Point and Interval Estimation
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised Learning
 
Predicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsPredicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation Amounts
 

Viewers also liked

Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup SlidesQuantUniversity
 
기업 분석 과제(디지털콘텐츠학과 임승일)
기업 분석 과제(디지털콘텐츠학과 임승일)기업 분석 과제(디지털콘텐츠학과 임승일)
기업 분석 과제(디지털콘텐츠학과 임승일)SeungIl Lim
 
Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache SparkMahmoud Parsian
 
Statistics tables grubb's test
Statistics tables grubb's testStatistics tables grubb's test
Statistics tables grubb's testcamiloaquintero
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection철 김
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for SecurityCody Rioux
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attackQrator Labs
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisManojit Nandi
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environmentsAlois Reitbauer
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alertsAlois Reitbauer
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing testAlois Reitbauer
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production AlertingAlois Reitbauer
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. Alois Reitbauer
 
SSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's InceptionSSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's Inceptionroyrapoport
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metricsroyrapoport
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014royrapoport
 

Viewers also liked (20)

Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
 
기업 분석 과제(디지털콘텐츠학과 임승일)
기업 분석 과제(디지털콘텐츠학과 임승일)기업 분석 과제(디지털콘텐츠학과 임승일)
기업 분석 과제(디지털콘텐츠학과 임승일)
 
Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache Spark
 
Statistics tables grubb's test
Statistics tables grubb's testStatistics tables grubb's test
Statistics tables grubb's test
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly Detection for Security
Anomaly Detection for SecurityAnomaly Detection for Security
Anomaly Detection for Security
 
Traffic anomaly detection and attack
Traffic anomaly detection and attackTraffic anomaly detection and attack
Traffic anomaly detection and attack
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
 
Monitoring large scale Docker production environments
Monitoring large scale Docker production environmentsMonitoring large scale Docker production environments
Monitoring large scale Docker production environments
 
Monitoring without alerts
Monitoring without alertsMonitoring without alerts
Monitoring without alerts
 
Can a monitoring tool pass the turing test
Can a monitoring tool pass the turing testCan a monitoring tool pass the turing test
Can a monitoring tool pass the turing test
 
The Dark Art of Production Alerting
The Dark Art of Production AlertingThe Dark Art of Production Alerting
The Dark Art of Production Alerting
 
The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection. The definition of normal - An introduction and guide to anomaly detection.
The definition of normal - An introduction and guide to anomaly detection.
 
SSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's InceptionSSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's Inception
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metrics
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014
 

Similar to PyGotham 2016

Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data AnalysisSaad Chahine
 
Lect 2 basic ppt
Lect 2 basic pptLect 2 basic ppt
Lect 2 basic pptTao Hong
 
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxExploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxMayura shelke
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffRaman Kannan
 
1608 probability and statistics in engineering
1608 probability and statistics in engineering1608 probability and statistics in engineering
1608 probability and statistics in engineeringDr Fereidoun Dejahang
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detectionguest0edcaf
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922stone55
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMapAshish Patel
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
Theory of estimation
Theory of estimationTheory of estimation
Theory of estimationTech_MX
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
 
Statistics for deep learning
Statistics for deep learningStatistics for deep learning
Statistics for deep learningSung Yub Kim
 
m2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxm2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxMesfinMelese4
 

Similar to PyGotham 2016 (20)

Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data Analysis
 
Lect 2 basic ppt
Lect 2 basic pptLect 2 basic ppt
Lect 2 basic ppt
 
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptxExploratory Data Analysis Unit 1 ppt presentation.pptx
Exploratory Data Analysis Unit 1 ppt presentation.pptx
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
1608 probability and statistics in engineering
1608 probability and statistics in engineering1608 probability and statistics in engineering
1608 probability and statistics in engineering
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
statistics
statisticsstatistics
statistics
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMap
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Theory of estimation
Theory of estimationTheory of estimation
Theory of estimation
 
Data science
Data scienceData science
Data science
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
Statistics for deep learning
Statistics for deep learningStatistics for deep learning
Statistics for deep learning
 
Cerdit card
Cerdit cardCerdit card
Cerdit card
 
m2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxm2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptx
 
Data science
Data scienceData science
Data science
 

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

PyGotham 2016

  • 1. ANOMALY DETECTION ALGORITHMS AND TECHNIQUES FOR REAL-WORLD SYSTEMS Manojit Nandi STEALTHbits Technologies @mnandi92
  • 2. OUTLINE Overview of anomaly detection Data Streaming Density-based Anomaly Detection Time Series based Methods Risk Scores and Testing
  • 3. WHAT ARE ANOMALIES? Hard to define; “You’ll know it when you see it”. Generally, anomalies are “anything that noticeably different” from the expected. One important thing to keep in mind is that what is considered anomalous now may not be considered anomalous in the future.
  • 4. Different approaches to anomaly detection We can develop a statistical model of normal behavior. Then we can test how likely an observation is under the model. We can take a machine learning approach and use a classifier to classify data points as “normal” or “anomalous”. In this talk, I’m going to cover algorithms specially designed to detect anomalies. Photo Credit: Microsoft Azure
  • 5. Anomalies in Data Streams Problem: We have data streaming in continuously, and we want to identify anomalies in real-time. Constraint: We can only examine the last 100 events in our sliding window. In data streaming problems, we are “restricted” to quick-and-dirty methods due to the limited memory and need for rapid action.
  • 6. Heuristic: Z-Scores Z-scores are often used as a test statistic to measure the “extremeness” of an observation in statistical hypothesis testing. How many standard deviations away from the mean is a particular observation?
  • 7. Moving Averages and Moving St. Dev. As the data comes in, we keep track of the average and the standard deviation of the last n data points. For each new data point, update the average and the standard deviation. Using the new average and standard deviation, compute the z-score for this data point. If the Z-score exceeds some threshold, flag the data point as anomalous.
  • 8. Standard Deviation not robust Standard deviation (and mean, to a lesser extend) is highly sensitive to extreme values. One extreme value can drastically increase the standard deviation. As a result, the Z-scores for other data points dramatically decreases as well.
  • 9. Mathematics of Robustness The arithmetic mean s is the number which solves the following minimization problem: The median m is the number which solves the following minimization problem:
  • 11. Median Absolute Deviation The Median Absolute Deviation provides a robust way to measure the “spread” of the data. Essentially, the median of the deviations from the “center” (the median). Provides a more robust measure of “spread” compared to standard deviation.
  • 13. Modified Z-Scores Now, we can use the median and the MAD to compute the modified Z- score for each data point. We then use the modified z-score to perform statistical hypothesis testing, in the same manner as the standard z-score.
  • 14. Density-Based Anomaly Detection We have a bunch of points in some n-dimensional space. Which ones are “noticeably” different from the others.
  • 15. Probabilistic Density: A Primer In statistics, we assume all data is generated according to some probability distribution. The goal of density-based methods is to estimate the underlying probability density function, based on the data. Many density-based methods, such as DBSCAN and Level Set Tree Clustering.
  • 16. Local Outlier Factor Goal: Quantify the relative density about a particular data point. Intuition: The anomalies should be more isolated compared to “normal” data points. LOF estimates density by looking at a small neighborhood about each point.
  • 17. K-Distance For each datapoint, compute the distance to the Kth-nearest neighbor. Meta-heuristic: The K-distance gives us a notion of “volume”. The more isolated a point is, the larger its k-distance.
  • 18. Reachability Distance Reachability-Distance(A,B) = Max(K-Dist(B), Dist(A,B)) “Do your close neighbors see you as one of their close neighbors”.
  • 19. Local Reachability Density Now, we are going to estimate the local density about each point. For each data point, compute the average reachability-distance to its K- nearest neighbors. The Local Reachability Density (LRD) of a data point A is defined as the inverse of this average reachability-distance.
  • 20. Local Outlier Factor LOF score is the ratio of point A’s density to the average density of its neighbors. Outliers come from less dense areas, so the ratio is higher for outliers.
  • 21. Interpreting LOF scores Normal points have LOF scores between 1 and 1.5. Anomalous points have much higher LOF scores. If a point has a LOF score of 3, then this means the average density of this point’s neighbors is about 3x more dense than its local density.
  • 22. Time Series based Anomalies Given activity indexed by time, can we identify extreme spikes and troughs in the time series. Want to find global anomalies and local anomalies. Source: Twitter Engineering Blog
  • 23. Seasonal Hybrid - ESD Algorithm invented at Twitter in 2015. Two components: Seasonal Decomposition: Remove seasonal patterns from the time series ESD: Iteratively test for outliers in the time series. Remove periodic patterns from the time series, then identify anomalies with the remaining “core” of the time series.
  • 24. Seasonal Decomposition Time Series Decomposition breaks a time series down into three components: a. Trend Component b. Seasonal Component c. Residual (or Random) Component The trend component contains the “meat” of the time series that we are interested in.
  • 25. Photo Credit: Penn State University
  • 26. Extreme Studentized Deviate ESD is a statistical procedure to iteratively test for outliers in a dataset. Specify the alpha level and the maximum number of anomalies to identify. ESD naturally applies a statistical correction to compensate for multiple hypothesis testing.
  • 27. Extreme Studentized Deviate 1. For each datapoint, compute a G-Score (absolute value of Z-Score) 2. Take the point with the highest G-Score. 3. Using the pre-specified alpha value, compute a critical value. 4. If the G-Score of the test point is greater than the critical value, flag the point as anomalous. 5. Remove this point from the data.
  • 28. Seasonal Hybrid - ESD: Example Photo Credit: Twitter Engineering Blog
  • 29. Risk Scores When doing anomaly detection in practice, you want to treat it more as a regression problem rather than a classification problem. Best practices recommend calculating the likelihood that a data point or event is anomalous and converting the likelihood into a risk score. Risk scores should be from 0-100 because people more intuitively understand this scale.
  • 30. Testing your algorithms Be sure to test algorithms in different environments. Just because it performs well in one environment does not mean it will generalize well to other environments. Since anomalies are rare, create synthetic datasets with built-in anomalies. If you can’t identify the built-in anomalies, then you have a problem. You should be constantly testing and fine-tuning your algorithms, so I recommend building a test harness to automate testing.