SlideShare a Scribd company logo
1 of 56
Download to read offline
© MapR Technologies, confidential
®
© MapR Technologies, confidential
®
Anomaly Detection
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Agenda
•  What is anomaly detection?
•  Some examples
•  Some generalization
•  More interesting examples
•  Sample implementation methods
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Who I am
•  Ted Dunning, Chief Application Architect, MapR
tdunning@mapr.com
tdunning@apache.org
@ted_dunning
•  Committer, mentor, champion, PMC member on
several Apache projects
•  Mahout, Drill, Zookeeper others
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Who we are
•  MapR makes the technology leading distribution
including Hadoop
•  MapR integrates real-time data semantics
directly into a system that also runs Hadoop
programs seamlessly
•  The biggest and best choose MapR
–  Google, Amazon
–  Largest credit card, retailer, health insurance, telco
–  Ping me for info
© MapR Technologies, confidential
®
© MapR Technologies, confidential
What is Anomaly Detection?
•  What just happened that shouldn’t?
–  but I don’t know what failure looks like (yet)
•  Find the problem before other people see it
–  especially customers and CEO’s
•  But don’t wake me up if it isn’t really broken
© MapR Technologies, confidential
®
© MapR Technologies, confidential
®
Looks pretty
anomalous
to me
© MapR Technologies, confidential
®
Will the real anomaly
please stand up?
© MapR Technologies, confidential
®
© MapR Technologies, confidential
What Are We Really Doing
•  We want action when something breaks
(dies/falls over/otherwise gets in trouble)
•  But action is expensive
•  So we don’t want false alarms
•  And we don’t want false negatives
•  We need to trade off costs
© MapR Technologies, confidential
®
A Second Look
© MapR Technologies, confidential
®
A Second Look
99.9%-ile
© MapR Technologies, confidential
®
With Spikes
© MapR Technologies, confidential
®
With Spikes
99.9%-ile including spikes
© MapR Technologies, confidential
®
How Hard Can it Be?
Online
Summarizer 99.9%-ile
t
x > t ? Alarm !
x
© MapR Technologies, confidential
®
© MapR Technologies, confidential
On-line Percentile Estimates
•  Apache Mahout has on-line percentile estimator
–  very high accuracy for extreme tails
–  new in version 0.9 !!
•  What’s the big deal with anomaly detection?
•  This looks like a solved problem
© MapR Technologies, confidential
®
Spot the Anomaly
Anomaly?
© MapR Technologies, confidential
®
Maybe not!
© MapR Technologies, confidential
®
Where’s Waldo?
This is the real
anomaly
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Normal Isn’t Just Normal
•  What we want is a model of what is normal
•  What doesn’t fit the model is the anomaly
•  For simple signals, the model can be simple …
•  The real world is rarely so accommodating
x ~ N(0,ε)
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
We Do Windows
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Windows on the World
•  The set of windowed signals is a nice model of
our original signal
•  Clustering can find the prototypes
–  Fancier techniques available using sparse coding
•  The result is a dictionary of shapes
•  New signals can be encoded by shifting, scaling
and adding shapes from the dictionary
© MapR Technologies, confidential
®
Most Common Shapes (for EKG)
© MapR Technologies, confidential
®
Reconstructed signal
Original
signal
Reconstructed
signal
Reconstruction
error
© MapR Technologies, confidential
®
An Anomaly
Original technique for
finding 1-d anomaly works
against reconstruction error
© MapR Technologies, confidential
®
Close-up of anomaly
Not what you want your
heart to do.
And not what the model
expects it to do.
© MapR Technologies, confidential
®
A Different Kind of Anomaly
© MapR Technologies, confidential
®
Model Delta Anomaly Detection
Online
Summarizer
δ > t ?
99.9%-ile
t
Alarm !
Model
-
+ δ
© MapR Technologies, confidential
®
© MapR Technologies, confidential
The Real Inside Scoop
•  The model-delta anomaly detector is really just a
mixture distribution
–  the model we know about already
–  and a normally distributed error
•  The output (delta) is (roughly) the log probability
of the mixture distribution
•  Thinking about probability distributions is good
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Example: Event Stream (timing)
•  Events of various types arrive at irregular
intervals
–  we can assume Poisson distribution
•  The key question is whether frequency has
changed relative to expected values
•  Want alert as soon as possible
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Poisson Distribution
•  Time between events is exponentially distributed
•  This means that long delays are exponentially
rare
•  If we know λ we can select a good threshold
–  or we can pick a threshold empirically
Δt ~ λe−λt
P(Δt > T) = e−λT
−logP(Δt > T) = λT
© MapR Technologies, confidential
®
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Example: Event Stream (sequence)
•  The scenario:
–  Web visitors are being subjected to phishing attacks
–  The hook directs the visitors to a mocked login
–  When they enter their credentials, the attacker uses
them to log in
–  We assume captcha’s or similar are part of the
authentication so the attacker has to show the images
on the login page to the visitor
•  The problem:
–  We don’t really know how the attack works
© MapR Technologies, confidential
®
Normal Event Flow
Image-1
Image-2
Image-3
Image-4
login
© MapR Technologies, confidential
®
Phishing Flow
Image-
1
Image-
1
Image-
1
Image-
1
login
Image-
1
Image-
1
Image-
1
Image-
1
login
Fake login
Real login
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Key Observations
•  Regardless of exact details, there are patterns
•  Event stream per user shows these patterns
•  Phishing will have different patterns at much
lower rate
•  Measuring statistical surprise gives a good
anomaly (fraud or malfunction) indicator
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Recap (out of order)
•  Anomaly detection is best done with a probability
model
•  Deep learning is a neat way to build this model
–  converting to symbolic dynamics simplifies life
•  -log p is a good way to convert to anomaly
measure
•  Adaptive quantile estimation works for auto-
setting thresholds
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Recap
•  Different systems require different models
•  Continuous time-series
–  sparse coding or deep learning to build signal model
•  Events in time
–  rate model base on variable rate Poisson
–  segregated rate model
•  Events with labels
–  language modeling
–  hidden Markov models
© MapR Technologies, confidential
®
© MapR Technologies, confidential
How Do I Build Such a System
•  The key is to combine real-time and long-time
–  real-time evaluates data stream against model
–  long-time is how we build the model
•  Extended Lambda architecture is my favorite
•  See my other talks on slideshare.net for info
•  Ping me directly
© MapR Technologies, confidential
®
t
now
Hadoop is Not Very Real-time
Unprocessed
Data
Fully
processed
Latest
full
period
Hadoop
job takes
this long
for this
data
© MapR Technologies, confidential
®
t
now
Hadoop works
great back here
Storm
works
here
Real-time and Long-time together
Blended
view
Blended
view
Blended
View
© MapR Technologies, confidential
®
© MapR Technologies, confidential
Who I am
•  Ted Dunning, Chief Application Architect, MapR
tdunning@mapr.com
tdunning@apache.org
@ted_dunning
•  Committer, mentor, champion, PMC member on
several Apache projects
•  Mahout, Drill, Zookeeper others
© MapR Technologies, confidential
®
© MapR Technologies, confidential ®

More Related Content

What's hot

Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...Edge AI and Vision Alliance
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceTed Dunning
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-dataTed Dunning
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitRafael Arana
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of ComputingIntel Nervana
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...Edge AI and Vision Alliance
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...Edge AI and Vision Alliance
 
First-ever scalable, distributed deep learning architecture using Spark & Tac...
First-ever scalable, distributed deep learning architecture using Spark & Tac...First-ever scalable, distributed deep learning architecture using Spark & Tac...
First-ever scalable, distributed deep learning architecture using Spark & Tac...Arimo, Inc.
 
Nervana Systems
Nervana SystemsNervana Systems
Nervana SystemsNand Dalal
 
AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series PolarSeven Pty Ltd
 
An Introduction to Deep Learning (May 2018)
An Introduction to Deep Learning (May 2018)An Introduction to Deep Learning (May 2018)
An Introduction to Deep Learning (May 2018)Julien SIMON
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Rakuten Group, Inc.
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Julien SIMON
 
Uncertainties in large scale power systems
Uncertainties in large scale power systemsUncertainties in large scale power systems
Uncertainties in large scale power systemsOlivier Teytaud
 

What's hot (19)

Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-data
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks Summit
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
First-ever scalable, distributed deep learning architecture using Spark & Tac...
First-ever scalable, distributed deep learning architecture using Spark & Tac...First-ever scalable, distributed deep learning architecture using Spark & Tac...
First-ever scalable, distributed deep learning architecture using Spark & Tac...
 
Nervana Systems
Nervana SystemsNervana Systems
Nervana Systems
 
presentation
presentationpresentation
presentation
 
AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series
 
An Introduction to Deep Learning (May 2018)
An Introduction to Deep Learning (May 2018)An Introduction to Deep Learning (May 2018)
An Introduction to Deep Learning (May 2018)
 
London hug
London hugLondon hug
London hug
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
 
Uncertainties in large scale power systems
Uncertainties in large scale power systemsUncertainties in large scale power systems
Uncertainties in large scale power systems
 

Similar to Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01

Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...Professor Lili Saghafi
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Aleksandr Tavgen
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...Moshe Zioni
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
Anomalies and events keep us on our toes
Anomalies and events keep us on our toesAnomalies and events keep us on our toes
Anomalies and events keep us on our toesCSIRO
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformDevOps.com
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
Machine Learning Introduction introducing basics of Machine Learning
Machine Learning Introduction introducing basics of Machine LearningMachine Learning Introduction introducing basics of Machine Learning
Machine Learning Introduction introducing basics of Machine Learningvipulkondekar
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Dieter Plaetinck
 

Similar to Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01 (20)

Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...
Quantum Computers new Generation of Computers part 7 by prof lili saghafi Qua...
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
InfoSecurity Europe 2017 - On The Hunt for Advanced Attacks? C&C Channels are...
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Anomalies and events keep us on our toes
Anomalies and events keep us on our toesAnomalies and events keep us on our toes
Anomalies and events keep us on our toes
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS Platform
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Machine Learning Introduction introducing basics of Machine Learning
Machine Learning Introduction introducing basics of Machine LearningMachine Learning Introduction introducing basics of Machine Learning
Machine Learning Introduction introducing basics of Machine Learning
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 

Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01

  • 1. © MapR Technologies, confidential ® © MapR Technologies, confidential ® Anomaly Detection
  • 2. © MapR Technologies, confidential ® © MapR Technologies, confidential Agenda •  What is anomaly detection? •  Some examples •  Some generalization •  More interesting examples •  Sample implementation methods
  • 3. © MapR Technologies, confidential ® © MapR Technologies, confidential Who I am •  Ted Dunning, Chief Application Architect, MapR tdunning@mapr.com tdunning@apache.org @ted_dunning •  Committer, mentor, champion, PMC member on several Apache projects •  Mahout, Drill, Zookeeper others
  • 4. © MapR Technologies, confidential ® © MapR Technologies, confidential Who we are •  MapR makes the technology leading distribution including Hadoop •  MapR integrates real-time data semantics directly into a system that also runs Hadoop programs seamlessly •  The biggest and best choose MapR –  Google, Amazon –  Largest credit card, retailer, health insurance, telco –  Ping me for info
  • 5. © MapR Technologies, confidential ® © MapR Technologies, confidential What is Anomaly Detection? •  What just happened that shouldn’t? –  but I don’t know what failure looks like (yet) •  Find the problem before other people see it –  especially customers and CEO’s •  But don’t wake me up if it isn’t really broken
  • 6. © MapR Technologies, confidential ®
  • 7. © MapR Technologies, confidential ® Looks pretty anomalous to me
  • 8. © MapR Technologies, confidential ® Will the real anomaly please stand up?
  • 9. © MapR Technologies, confidential ® © MapR Technologies, confidential What Are We Really Doing •  We want action when something breaks (dies/falls over/otherwise gets in trouble) •  But action is expensive •  So we don’t want false alarms •  And we don’t want false negatives •  We need to trade off costs
  • 10. © MapR Technologies, confidential ® A Second Look
  • 11. © MapR Technologies, confidential ® A Second Look 99.9%-ile
  • 12. © MapR Technologies, confidential ® With Spikes
  • 13. © MapR Technologies, confidential ® With Spikes 99.9%-ile including spikes
  • 14. © MapR Technologies, confidential ® How Hard Can it Be? Online Summarizer 99.9%-ile t x > t ? Alarm ! x
  • 15. © MapR Technologies, confidential ® © MapR Technologies, confidential On-line Percentile Estimates •  Apache Mahout has on-line percentile estimator –  very high accuracy for extreme tails –  new in version 0.9 !! •  What’s the big deal with anomaly detection? •  This looks like a solved problem
  • 16. © MapR Technologies, confidential ® Spot the Anomaly Anomaly?
  • 17. © MapR Technologies, confidential ® Maybe not!
  • 18. © MapR Technologies, confidential ® Where’s Waldo? This is the real anomaly
  • 19. © MapR Technologies, confidential ® © MapR Technologies, confidential Normal Isn’t Just Normal •  What we want is a model of what is normal •  What doesn’t fit the model is the anomaly •  For simple signals, the model can be simple … •  The real world is rarely so accommodating x ~ N(0,ε)
  • 20. © MapR Technologies, confidential ® We Do Windows
  • 21. © MapR Technologies, confidential ® We Do Windows
  • 22. © MapR Technologies, confidential ® We Do Windows
  • 23. © MapR Technologies, confidential ® We Do Windows
  • 24. © MapR Technologies, confidential ® We Do Windows
  • 25. © MapR Technologies, confidential ® We Do Windows
  • 26. © MapR Technologies, confidential ® We Do Windows
  • 27. © MapR Technologies, confidential ® We Do Windows
  • 28. © MapR Technologies, confidential ® We Do Windows
  • 29. © MapR Technologies, confidential ® We Do Windows
  • 30. © MapR Technologies, confidential ® We Do Windows
  • 31. © MapR Technologies, confidential ® We Do Windows
  • 32. © MapR Technologies, confidential ® We Do Windows
  • 33. © MapR Technologies, confidential ® We Do Windows
  • 34. © MapR Technologies, confidential ® We Do Windows
  • 35. © MapR Technologies, confidential ® © MapR Technologies, confidential Windows on the World •  The set of windowed signals is a nice model of our original signal •  Clustering can find the prototypes –  Fancier techniques available using sparse coding •  The result is a dictionary of shapes •  New signals can be encoded by shifting, scaling and adding shapes from the dictionary
  • 36. © MapR Technologies, confidential ® Most Common Shapes (for EKG)
  • 37. © MapR Technologies, confidential ® Reconstructed signal Original signal Reconstructed signal Reconstruction error
  • 38. © MapR Technologies, confidential ® An Anomaly Original technique for finding 1-d anomaly works against reconstruction error
  • 39. © MapR Technologies, confidential ® Close-up of anomaly Not what you want your heart to do. And not what the model expects it to do.
  • 40. © MapR Technologies, confidential ® A Different Kind of Anomaly
  • 41. © MapR Technologies, confidential ® Model Delta Anomaly Detection Online Summarizer δ > t ? 99.9%-ile t Alarm ! Model - + δ
  • 42. © MapR Technologies, confidential ® © MapR Technologies, confidential The Real Inside Scoop •  The model-delta anomaly detector is really just a mixture distribution –  the model we know about already –  and a normally distributed error •  The output (delta) is (roughly) the log probability of the mixture distribution •  Thinking about probability distributions is good
  • 43. © MapR Technologies, confidential ® © MapR Technologies, confidential Example: Event Stream (timing) •  Events of various types arrive at irregular intervals –  we can assume Poisson distribution •  The key question is whether frequency has changed relative to expected values •  Want alert as soon as possible
  • 44. © MapR Technologies, confidential ® © MapR Technologies, confidential Poisson Distribution •  Time between events is exponentially distributed •  This means that long delays are exponentially rare •  If we know λ we can select a good threshold –  or we can pick a threshold empirically Δt ~ λe−λt P(Δt > T) = e−λT −logP(Δt > T) = λT
  • 45. © MapR Technologies, confidential ® Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 46. © MapR Technologies, confidential ® © MapR Technologies, confidential Example: Event Stream (sequence) •  The scenario: –  Web visitors are being subjected to phishing attacks –  The hook directs the visitors to a mocked login –  When they enter their credentials, the attacker uses them to log in –  We assume captcha’s or similar are part of the authentication so the attacker has to show the images on the login page to the visitor •  The problem: –  We don’t really know how the attack works
  • 47. © MapR Technologies, confidential ® Normal Event Flow Image-1 Image-2 Image-3 Image-4 login
  • 48. © MapR Technologies, confidential ® Phishing Flow Image- 1 Image- 1 Image- 1 Image- 1 login Image- 1 Image- 1 Image- 1 Image- 1 login Fake login Real login
  • 49. © MapR Technologies, confidential ® © MapR Technologies, confidential Key Observations •  Regardless of exact details, there are patterns •  Event stream per user shows these patterns •  Phishing will have different patterns at much lower rate •  Measuring statistical surprise gives a good anomaly (fraud or malfunction) indicator
  • 50. © MapR Technologies, confidential ® © MapR Technologies, confidential Recap (out of order) •  Anomaly detection is best done with a probability model •  Deep learning is a neat way to build this model –  converting to symbolic dynamics simplifies life •  -log p is a good way to convert to anomaly measure •  Adaptive quantile estimation works for auto- setting thresholds
  • 51. © MapR Technologies, confidential ® © MapR Technologies, confidential Recap •  Different systems require different models •  Continuous time-series –  sparse coding or deep learning to build signal model •  Events in time –  rate model base on variable rate Poisson –  segregated rate model •  Events with labels –  language modeling –  hidden Markov models
  • 52. © MapR Technologies, confidential ® © MapR Technologies, confidential How Do I Build Such a System •  The key is to combine real-time and long-time –  real-time evaluates data stream against model –  long-time is how we build the model •  Extended Lambda architecture is my favorite •  See my other talks on slideshare.net for info •  Ping me directly
  • 53. © MapR Technologies, confidential ® t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  • 54. © MapR Technologies, confidential ® t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  • 55. © MapR Technologies, confidential ® © MapR Technologies, confidential Who I am •  Ted Dunning, Chief Application Architect, MapR tdunning@mapr.com tdunning@apache.org @ted_dunning •  Committer, mentor, champion, PMC member on several Apache projects •  Mahout, Drill, Zookeeper others
  • 56. © MapR Technologies, confidential ® © MapR Technologies, confidential ®