SlideShare uma empresa Scribd logo
1 de 63
Baixar para ler offline
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Ted Dunning
®
© 2014 MapR Technologies 2
Steps in Anomaly Detection
•  Build a model: Collect and process data for training a model
•  Use the machine learning model to determine what is the normal
pattern
•  Decide how far away from this normal pattern you’ll consider to
be anomalous
•  Use the AD model to detect anomalies in new data
–  Methods such as clustering for discovery can be helpful
®
© 2014 MapR Technologies 3
How hard is it to set an alert for anomalies?
Grey data is from normal events; x’s are anomalies.
Where would you set the threshold?
®
© 2014 MapR Technologies 4
Basic idea:

Set adaptive thresholds
®
© 2014 MapR Technologies 5
What Are We Really Doing
•  We want action when something breaks
(dies/falls over/otherwise gets in trouble)
•  But action is expensive
•  So we don’t want too many false alarms
•  And we don’t want too many false negatives
•  What’s the right threshold to set for alerts?
–  We need to trade off costs
®
© 2014 MapR Technologies 6
A Second Look
®
© 2014 MapR Technologies 7
A Second Look
99.9%-ile
®
© 2014 MapR Technologies 8
Cool algorithm: t-digest
®
© 2014 MapR Technologies 9
Online
Summarizer
99.9%-ile
t
x > t ? Alarm !
x
How Hard Can it Be?
®
© 2014 MapR Technologies 10
Using t-Digest
•  The t-digest is an on-line percentile estimator
–  very high accuracy for extreme tails
•  t-digest also available everywhere
–  in ElasticSearch, in Solr
–  in streamlib (open source library on github)
–  in Mahout Math (open source library on github)
–  standalone (github and Maven Central)
•  Very handy for general distributions, few assumptions
•  For latency, exponential binning may be useful
–  See, for instance, hdrhistorgram
®
© 2014 MapR Technologies 11
So are we all done?
®
© 2014 MapR Technologies 12
What About This?
0 5 10 15
−20246810
offset+noise+pulse1+pulse2
A
B
®
© 2014 MapR Technologies 13
Model Delta Anomaly Detection
Online
Summarizer
δ > t ?
99.9%-ile
t
Alarm !
Model
-
+ δ
®
© 2014 MapR Technologies 14
Spot the Anomaly
Anomaly?
®
© 2014 MapR Technologies 15
Maybe not!
®
© 2014 MapR Technologies 16
Where’s Waldo?
This is the real
anomaly
®
© 2014 MapR Technologies 17
Normal Isn’t Just Normal
•  What we want is a model of what is normal
•  What doesn’t fit the model is the anomaly
•  For simple signals, the model can be simple …
•  The real world is rarely so accommodating
x ~ m(t)+ N(0,ε)
®
© 2014 MapR Technologies 18
We Do Windows
®
© 2014 MapR Technologies 19
We Do Windows
®
© 2014 MapR Technologies 20
We Do Windows
®
© 2014 MapR Technologies 21
We Do Windows
®
© 2014 MapR Technologies 22
We Do Windows
®
© 2014 MapR Technologies 23
We Do Windows
®
© 2014 MapR Technologies 24
We Do Windows
®
© 2014 MapR Technologies 25
We Do Windows
®
© 2014 MapR Technologies 26
We Do Windows
®
© 2014 MapR Technologies 27
We Do Windows
®
© 2014 MapR Technologies 28
We Do Windows
®
© 2014 MapR Technologies 29
We Do Windows
®
© 2014 MapR Technologies 30
We Do Windows
®
© 2014 MapR Technologies 31
We Do Windows
®
© 2014 MapR Technologies 32
We Do Windows
®
© 2014 MapR Technologies 33
Windows on the World
•  The set of windowed signals is a nice model of our original signal
•  Clustering can find the prototypes
–  Fancier techniques available using sparse coding
•  The result is a dictionary of shapes
•  New signals can be encoded by shifting, scaling and adding
shapes from the dictionary
®
© 2014 MapR Technologies 34
Most Common Shapes (for EKG)
®
© 2014 MapR Technologies 35
Reconstructed signal
Original
signal
Reconstructed
signal
Reconstruction
error
< 1 bit / sample
®
© 2014 MapR Technologies 36
An Anomaly
Original technique for finding
1-d anomaly works against
reconstruction error
®
© 2014 MapR Technologies 37
Close-up of anomaly
Not what you want your
heart to do.
And not what the model
expects it to do.
®
© 2014 MapR Technologies 38
A Different Kind of Anomaly
®
© 2014 MapR Technologies 39
Model Delta Anomaly Detection
Online
Summarizer
δ > t ?
99.9%-ile
t
Alarm !
Model
-
+ δ
®
© 2014 MapR Technologies 40
The Real Inside Scoop
•  The model-delta anomaly detector is really just a sum of random
variables
–  the model we know about already
–  and a normally distributed error
•  The output (delta) is (roughly) the log probability of the sum
distribution (really δ2)
•  Thinking about probability distributions is good
®
© 2014 MapR Technologies 41
Some k-means Caveats
•  But Eamonn Keogh says that k-means can’t work on time-series
•  That is silly … and kind of correct, k-means does have limits
–  Other kinds of auto-encoders are much more powerful
•  More fun and code demos at
–  https://github.com/tdunning/k-means-auto-encoder
http://www.cs.ucr.edu/~eamonn/meaningless.pdf
Clustering of Time Series Subsequences is Meaningless:
Implications for Previous and Future Research
Eamonn Keogh Jessica Lin
Computer Science & Engineering Department
University of California - Riverside
Riverside, CA 92521
{eamonn, jessica}@cs.ucr.edu
Abstract
Given the recent explosion of interest in streaming data and online algorithms, clustering of time series
subsequences, extracted via a sliding window, has received much attention. In this work we make a
surprising claim. Clustering of time series subsequences is meaningless. More concretely, clusters extracted
from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by
any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random.
While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has
never appeared in the literature. We can justify calling our claim surprising, since it invalidates the
contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative
®
© 2014 MapR Technologies 42
The Limits of Clustering as Auto-encoder
•  Clustering is like trying to tile your sample distribution
•  Can be used to approximate a signal
•  Filling d dimensional region with k clusters should give
•  If d is large, this is no good
ε ≈ 1/ kd
®
© 2014 MapR Technologies 43
0 500 1000 1500 2000
−2−1012
Time series training data (first 2000 samples)
Time
●
●
●
Test data
Reconstruction
Error
®
© 2014 MapR Technologies 44
●
●
●
●
●
●
0 500 1000 1500 2000
0.000.050.100.15
Reconstruction error for time−series data
Centroids
MAVError
●
●
●
●
●
●
●
●
Training data
Held−out data
®
© 2014 MapR Technologies 45
Another Example
•  Take points randomly in , project non-linearly into
•  Approximation using clustering should give
®
© 2014 MapR Technologies 46
●
●
●
●
●
●
●
●
0 500 1000 1500 2000
0.00.51.01.52.0
Reconstruction error for random points
Centroids
Error
●
●
●
●
●
●
●
●
●
●
Training data
Held−out data
®
© 2014 MapR Technologies 47
●
●
●
●
●
●
●
●
0 500 1000 1500 2000
0.00.51.01.52.0
Error is approximately cube root of k
k
Error ●
●
Actual
Cube root model
®
© 2014 MapR Technologies 48
Moral For Auto-encoders
•  The simplest auto-encoders can be good models
•  For more complex spaces/signals, more elaborate models may
be required
–  Winner take (absolutely) all may be problematic
–  In particular, models that allow sparse linear combination may be better
•  Consider deep learning, recurrent networks, denoising
®
© 2014 MapR Technologies 49
How Does Clustering Do Reconstruction?
x1 x2
...
xn-1 xn
Input
For normalized cluster centroids,
dot-product and distance are equivalent
®
© 2014 MapR Technologies 50
How Does Clustering Do Reconstruction?
x1 x2
...
xn-1 xn
Input
Winner takes all with k-means
®
© 2014 MapR Technologies 51
How Does Clustering Do Reconstruction?
x1 x2
...
xn-1 xn
x'1 x'2
...
x'n-1 x'n
Input
Hidden layer
(clusters)
Reconstruction
Dot-product scales
centroid to reconstruct
®
© 2014 MapR Technologies 52
AKA - Neural Network
x1 x2
...
xn-1 xn
x'1 x'2
...
x'n-1 x'n
Input
Hidden layer
(clusters)
Reconstruction
®
© 2014 MapR Technologies 53
What If … We Had More Layers?
...
...
...
...
... ... ... ... ...
... ... ... ... ...
A
B
A'
®
© 2014 MapR Technologies 54
Other Thoughts
•  What if we allow more than one cluster to be active?
–  k-sparse learning!
®
© 2014 MapR Technologies 55
Other Thoughts
•  What if we allow more than one cluster to be active?
–  k-sparse learning!
®
© 2014 MapR Technologies 56
Other Thoughts
•  What if we allow more than one cluster to be active?
–  k-sparse learning!
®
© 2014 MapR Technologies 57
Other Thoughts
•  What if we allow more than one cluster to be active?
–  k-sparse learning!
•  Well, almost
®
© 2014 MapR Technologies 58
Summary
•  Start with philosophy
–  Anomaly detection is finding normal, then finding discrepancy
•  Model the world with probabilities
–  Realistic probabilistic models and statistical inference are optimal
•  Very simple techniques can extend easily to very fancy ones
®
© 2014 MapR Technologies 59
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
®
© 2014 MapR Technologies 60
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
®
© 2014 MapR Technologies 61
Thank you for coming today!



®
© 2014 MapR Technologies 62
bit.ly/big-data-science-june-2016
Find my slides & other related materials to this talk here:
or search:
®
© 2014 MapR Technologies 63
…helping you put data technology to work
●  Find answers
●  Ask technical questions
●  Join on-demand training course
discussions
●  Follow release announcements
●  Share and vote on product ideas
●  Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com

Mais conteúdo relacionado

Mais procurados

Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 

Mais procurados (20)

Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 

Semelhante a Mathematical bridges From Old to New

How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith ChaosMapR Technologies
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with ChaosMapR Technologies
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With ChaosDataWorks Summit
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceTed Dunning
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01MapR Technologies
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Throttling Malware Families in 2D
Throttling Malware Families in 2DThrottling Malware Families in 2D
Throttling Malware Families in 2DMohamed Nassar
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf
 

Semelhante a Mathematical bridges From Old to New (20)

How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Throttling Malware Families in 2D
Throttling Malware Families in 2DThrottling Malware Families in 2D
Throttling Malware Families in 2D
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 

Mais de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mais de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Mathematical bridges From Old to New

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Ted Dunning
  • 2. ® © 2014 MapR Technologies 2 Steps in Anomaly Detection •  Build a model: Collect and process data for training a model •  Use the machine learning model to determine what is the normal pattern •  Decide how far away from this normal pattern you’ll consider to be anomalous •  Use the AD model to detect anomalies in new data –  Methods such as clustering for discovery can be helpful
  • 3. ® © 2014 MapR Technologies 3 How hard is it to set an alert for anomalies? Grey data is from normal events; x’s are anomalies. Where would you set the threshold?
  • 4. ® © 2014 MapR Technologies 4 Basic idea:
 Set adaptive thresholds
  • 5. ® © 2014 MapR Technologies 5 What Are We Really Doing •  We want action when something breaks (dies/falls over/otherwise gets in trouble) •  But action is expensive •  So we don’t want too many false alarms •  And we don’t want too many false negatives •  What’s the right threshold to set for alerts? –  We need to trade off costs
  • 6. ® © 2014 MapR Technologies 6 A Second Look
  • 7. ® © 2014 MapR Technologies 7 A Second Look 99.9%-ile
  • 8. ® © 2014 MapR Technologies 8 Cool algorithm: t-digest
  • 9. ® © 2014 MapR Technologies 9 Online Summarizer 99.9%-ile t x > t ? Alarm ! x How Hard Can it Be?
  • 10. ® © 2014 MapR Technologies 10 Using t-Digest •  The t-digest is an on-line percentile estimator –  very high accuracy for extreme tails •  t-digest also available everywhere –  in ElasticSearch, in Solr –  in streamlib (open source library on github) –  in Mahout Math (open source library on github) –  standalone (github and Maven Central) •  Very handy for general distributions, few assumptions •  For latency, exponential binning may be useful –  See, for instance, hdrhistorgram
  • 11. ® © 2014 MapR Technologies 11 So are we all done?
  • 12. ® © 2014 MapR Technologies 12 What About This? 0 5 10 15 −20246810 offset+noise+pulse1+pulse2 A B
  • 13. ® © 2014 MapR Technologies 13 Model Delta Anomaly Detection Online Summarizer δ > t ? 99.9%-ile t Alarm ! Model - + δ
  • 14. ® © 2014 MapR Technologies 14 Spot the Anomaly Anomaly?
  • 15. ® © 2014 MapR Technologies 15 Maybe not!
  • 16. ® © 2014 MapR Technologies 16 Where’s Waldo? This is the real anomaly
  • 17. ® © 2014 MapR Technologies 17 Normal Isn’t Just Normal •  What we want is a model of what is normal •  What doesn’t fit the model is the anomaly •  For simple signals, the model can be simple … •  The real world is rarely so accommodating x ~ m(t)+ N(0,ε)
  • 18. ® © 2014 MapR Technologies 18 We Do Windows
  • 19. ® © 2014 MapR Technologies 19 We Do Windows
  • 20. ® © 2014 MapR Technologies 20 We Do Windows
  • 21. ® © 2014 MapR Technologies 21 We Do Windows
  • 22. ® © 2014 MapR Technologies 22 We Do Windows
  • 23. ® © 2014 MapR Technologies 23 We Do Windows
  • 24. ® © 2014 MapR Technologies 24 We Do Windows
  • 25. ® © 2014 MapR Technologies 25 We Do Windows
  • 26. ® © 2014 MapR Technologies 26 We Do Windows
  • 27. ® © 2014 MapR Technologies 27 We Do Windows
  • 28. ® © 2014 MapR Technologies 28 We Do Windows
  • 29. ® © 2014 MapR Technologies 29 We Do Windows
  • 30. ® © 2014 MapR Technologies 30 We Do Windows
  • 31. ® © 2014 MapR Technologies 31 We Do Windows
  • 32. ® © 2014 MapR Technologies 32 We Do Windows
  • 33. ® © 2014 MapR Technologies 33 Windows on the World •  The set of windowed signals is a nice model of our original signal •  Clustering can find the prototypes –  Fancier techniques available using sparse coding •  The result is a dictionary of shapes •  New signals can be encoded by shifting, scaling and adding shapes from the dictionary
  • 34. ® © 2014 MapR Technologies 34 Most Common Shapes (for EKG)
  • 35. ® © 2014 MapR Technologies 35 Reconstructed signal Original signal Reconstructed signal Reconstruction error < 1 bit / sample
  • 36. ® © 2014 MapR Technologies 36 An Anomaly Original technique for finding 1-d anomaly works against reconstruction error
  • 37. ® © 2014 MapR Technologies 37 Close-up of anomaly Not what you want your heart to do. And not what the model expects it to do.
  • 38. ® © 2014 MapR Technologies 38 A Different Kind of Anomaly
  • 39. ® © 2014 MapR Technologies 39 Model Delta Anomaly Detection Online Summarizer δ > t ? 99.9%-ile t Alarm ! Model - + δ
  • 40. ® © 2014 MapR Technologies 40 The Real Inside Scoop •  The model-delta anomaly detector is really just a sum of random variables –  the model we know about already –  and a normally distributed error •  The output (delta) is (roughly) the log probability of the sum distribution (really δ2) •  Thinking about probability distributions is good
  • 41. ® © 2014 MapR Technologies 41 Some k-means Caveats •  But Eamonn Keogh says that k-means can’t work on time-series •  That is silly … and kind of correct, k-means does have limits –  Other kinds of auto-encoders are much more powerful •  More fun and code demos at –  https://github.com/tdunning/k-means-auto-encoder http://www.cs.ucr.edu/~eamonn/meaningless.pdf Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research Eamonn Keogh Jessica Lin Computer Science & Engineering Department University of California - Riverside Riverside, CA 92521 {eamonn, jessica}@cs.ucr.edu Abstract Given the recent explosion of interest in streaming data and online algorithms, clustering of time series subsequences, extracted via a sliding window, has received much attention. In this work we make a surprising claim. Clustering of time series subsequences is meaningless. More concretely, clusters extracted from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising, since it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative
  • 42. ® © 2014 MapR Technologies 42 The Limits of Clustering as Auto-encoder •  Clustering is like trying to tile your sample distribution •  Can be used to approximate a signal •  Filling d dimensional region with k clusters should give •  If d is large, this is no good ε ≈ 1/ kd
  • 43. ® © 2014 MapR Technologies 43 0 500 1000 1500 2000 −2−1012 Time series training data (first 2000 samples) Time ● ● ● Test data Reconstruction Error
  • 44. ® © 2014 MapR Technologies 44 ● ● ● ● ● ● 0 500 1000 1500 2000 0.000.050.100.15 Reconstruction error for time−series data Centroids MAVError ● ● ● ● ● ● ● ● Training data Held−out data
  • 45. ® © 2014 MapR Technologies 45 Another Example •  Take points randomly in , project non-linearly into •  Approximation using clustering should give
  • 46. ® © 2014 MapR Technologies 46 ● ● ● ● ● ● ● ● 0 500 1000 1500 2000 0.00.51.01.52.0 Reconstruction error for random points Centroids Error ● ● ● ● ● ● ● ● ● ● Training data Held−out data
  • 47. ® © 2014 MapR Technologies 47 ● ● ● ● ● ● ● ● 0 500 1000 1500 2000 0.00.51.01.52.0 Error is approximately cube root of k k Error ● ● Actual Cube root model
  • 48. ® © 2014 MapR Technologies 48 Moral For Auto-encoders •  The simplest auto-encoders can be good models •  For more complex spaces/signals, more elaborate models may be required –  Winner take (absolutely) all may be problematic –  In particular, models that allow sparse linear combination may be better •  Consider deep learning, recurrent networks, denoising
  • 49. ® © 2014 MapR Technologies 49 How Does Clustering Do Reconstruction? x1 x2 ... xn-1 xn Input For normalized cluster centroids, dot-product and distance are equivalent
  • 50. ® © 2014 MapR Technologies 50 How Does Clustering Do Reconstruction? x1 x2 ... xn-1 xn Input Winner takes all with k-means
  • 51. ® © 2014 MapR Technologies 51 How Does Clustering Do Reconstruction? x1 x2 ... xn-1 xn x'1 x'2 ... x'n-1 x'n Input Hidden layer (clusters) Reconstruction Dot-product scales centroid to reconstruct
  • 52. ® © 2014 MapR Technologies 52 AKA - Neural Network x1 x2 ... xn-1 xn x'1 x'2 ... x'n-1 x'n Input Hidden layer (clusters) Reconstruction
  • 53. ® © 2014 MapR Technologies 53 What If … We Had More Layers? ... ... ... ... ... ... ... ... ... ... ... ... ... ... A B A'
  • 54. ® © 2014 MapR Technologies 54 Other Thoughts •  What if we allow more than one cluster to be active? –  k-sparse learning!
  • 55. ® © 2014 MapR Technologies 55 Other Thoughts •  What if we allow more than one cluster to be active? –  k-sparse learning!
  • 56. ® © 2014 MapR Technologies 56 Other Thoughts •  What if we allow more than one cluster to be active? –  k-sparse learning!
  • 57. ® © 2014 MapR Technologies 57 Other Thoughts •  What if we allow more than one cluster to be active? –  k-sparse learning! •  Well, almost
  • 58. ® © 2014 MapR Technologies 58 Summary •  Start with philosophy –  Anomaly detection is finding normal, then finding discrepancy •  Model the world with probabilities –  Realistic probabilistic models and statistical inference are optimal •  Very simple techniques can extend easily to very fancy ones
  • 59. ® © 2014 MapR Technologies 59 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 60. ® © 2014 MapR Technologies 60 Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams
  • 61. ® © 2014 MapR Technologies 61 Thank you for coming today!
 

  • 62. ® © 2014 MapR Technologies 62 bit.ly/big-data-science-june-2016 Find my slides & other related materials to this talk here: or search:
  • 63. ® © 2014 MapR Technologies 63 …helping you put data technology to work ●  Find answers ●  Ask technical questions ●  Join on-demand training course discussions ●  Follow release announcements ●  Share and vote on product ideas ●  Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com