Splunk live! Customer Presentation – Prelert

Anomaly Detection using Machine Learning
Predictive Analytics
the anomaly detection company

Terminology
• Machine-learning
 Autonomous self-learning without the assistance of humans
(unsupervised learning)
• Predictive Analytics
 Probabilistic prediction of behavior based upon observed past
behavior
• Anomaly Detection
 what’s “different” or weird” versus what’s “good” or “bad”

Q: What’s Interesting Here?
3

A: Only What’s Behaving Abnormally
4

Anomaly Detection - an Analogy
• How could I accurately predict how much Postal-mail you are likely to
get delivered to your home tomorrow?
• And, how would I know if the amount you received was “abnormal”?

A practical methodology would involve…
• First, determine what’s normal before I can declare what’s abnormal
• Watch your mail delivery volume for a while…
 1 day?
 1 week?
 1 month?
• Notice, that you intuitively feel like you’ll gain accuracy in your
predictions with more data that you see.
• Ideally, use those observations to create a…

Probability Distribution Function
pieces of mail per day
%likelihood(probability)

Best for my house

College Student?

My Mom

Finding “what’s unexpected”…
Your job is often looking for unexpected change in
your environment, either proactively through
monitoring or reactively through
diagnostics/troubleshooting

Using the PDF to Find
What is Unexpected
zero
pieces of
mail?
fifteen
pieces of
mail?

Relate back to IT and Security data
• # Pieces of mail = # events of a certain type
 Number of failed logins
 Number of errors of different types
 Number of events with certain status codes
 Etc.
• Or, performance metrics
 Response time
 Utilization %
=> Every kind of data will need its own unique “model” (probability
distribution function)

Do You Know How to Accurately Model?
• Which one(s) models your data
best?
• You will want to get it right
14
source: “Doing Data Science”
O’Neil & Schutt
avg +/- 2 stdev
assumes Gaussian
(Normal)
Distribution!

Gaussian (“Normal”) Distribution
15

Non-Gaussian Data
status=503
status=404
CPU load
Memory Utilization
Revenue Transactions

Standard Deviations – Not so Good
33,000+ performance metrics analyzed using +/-
2.5σ
0
1000
2000
3000
4000
5000
6000
7000
28 Feb 00:00 28 Feb 12:00 01 Mar 00:00 01 Mar 12:00 02 Mar 00:00 02 Mar 12:00 03 Mar 00:00 03 Mar 12:00
• Never less than 900 alerts per hour
• Real outage (circled)
overshadowed by ~6000
extraneous alerts
Total # Alerts

Don’t worry, we have you covered
• Prelert uses sophisticated
machine-learning techniques
to best-fit the right statistical
model for your data.
• Better models = better outlier
detection = less false alarms
20

Kinds of Anomalies Detected
22
Deviations in event count vs. time
Deviations in values vs. time
Rare occurrences of things
Population/Peer outliers

#1) Deviations in Event Counts/Rates
• Use Case: Online Commerce Site
 Cyclical online ordering volume (credit cards, etc.)
 Service outage on May 10th orders not being processed, dip in afternoon volume
23

Hard to automatically detect because…
• Tricky to catch with thresholds because overall count didn’t dip below low watermark
• Output of Splunk “predict”:
24

Prelert finds the anomaly perfectly
25
• No extraneous false alarms
• Despite the inherent challenges of the periodic nature of the data

#2) Deviations in Performance Metrics
• Use Case: Online travel portal
• Makes web services calls to airlines for fare quotes
• Each airline responds to fare request with its own typical response
time (20 airlines):
26

• Tricky to construct unique thresholds for each airline individually
• Cannot do “avg +/- 2σ” because it is too noisy for this kind of data
• Splunk’s “predict” doesn’t support explosion out via by clause (“by airline”)
27

28
• Only 1 of the many airlines is having an issue

#3) Rare Items as Anomalies
• Use Case: Security team @ services company
• Wanted to profile typical processes on each host using netstat
• Goal was to identify rare processes that “start up and communicate”
for each host, individually
29

• Each host has it’s own separate “set” of typical processes
that are potentially unique
• i.e. FTP may run routinely run on server A, but never runs on server
B
• Maintaining a running list of “typical processes” across
hundreds of servers not practical
• Splunk “rare” command is not truly a rarity measurement,
just “least occurring”
30

31
• Finds FTP process running for 3 hours on system that doesn’t normally run FTP

#4) Population / Peer Outliers
• Use Case: Proxy log data
 Need to determine which users/systems are sending
out requests/data much differently than the others
32

• Peer analysis is impossible without Prelert
33

34
• One particular host sending many requests (20,000/hr) to an IIS webserver
• This is an attempt to hack the webserver

Anomaly Detective App
• Free to download and try – 100% native Splunk app
• Easy to use – “push button anomaly detection”
• More powerful anomaly detection than Splunk on its own
• Scalable for big data sets
35
http://goo.gl/KJY9B

Bonus – Anomaly Cross-Correlation
• Use Case: Retail company with flaky POS application (gift card
redemption)
 App occasionally disconnects from DB
 Team suspects either a DB or a network problem, but hard to find cause
• Prelert configured to run anomaly detection across 3 data types
simultaneously
 App logs (unstructured) – count by dynamic message type
 SQL Server performance metrics
 Network performance metrics
36

Result: Instant Answers
37
Symptom: Sudden
influx of DB errors
in log
Symptom: Drop in
SQL Server client
connections
Cause: Network
spike and TCP
discards

Splunk live! Customer Presentation – Prelert

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Destaque

Destaque (11)

Semelhante a Splunk live! Customer Presentation – Prelert

Semelhante a Splunk live! Customer Presentation – Prelert (20)

Mais de Splunk

Mais de Splunk (20)

Último

Último (20)

Splunk live! Customer Presentation – Prelert

Notas do Editor