2. Terminology
• Machine-learning
Autonomous self-learning without the assistance of humans
(unsupervised learning)
• Predictive Analytics
Probabilistic prediction of behavior based upon observed past
behavior
• Anomaly Detection
what’s “different” or weird” versus what’s “good” or “bad”
5. Anomaly Detection - an Analogy
• How could I accurately predict how much Postal-mail you are likely to
get delivered to your home tomorrow?
• And, how would I know if the amount you received was “abnormal”?
6. A practical methodology would involve…
• First, determine what’s normal before I can declare what’s abnormal
• Watch your mail delivery volume for a while…
1 day?
1 week?
1 month?
• Notice, that you intuitively feel like you’ll gain accuracy in your
predictions with more data that you see.
• Ideally, use those observations to create a…
11. Finding “what’s unexpected”…
Your job is often looking for unexpected change in
your environment, either proactively through
monitoring or reactively through
diagnostics/troubleshooting
12. Using the PDF to Find
What is Unexpected
pieces of mail per day
%likelihood(probability)
zero
pieces of
mail?
fifteen
pieces of
mail?
13. Relate back to IT and Security data
• # Pieces of mail = # events of a certain type
Number of failed logins
Number of errors of different types
Number of events with certain status codes
Etc.
• Or, performance metrics
Response time
Utilization %
=> Every kind of data will need its own unique “model” (probability
distribution function)
14. Do You Know How to Accurately Model?
• Which one(s) models your data
best?
• You will want to get it right
14
source: “Doing Data Science”
O’Neil & Schutt
avg +/- 2 stdev
assumes Gaussian
(Normal)
Distribution!
17. Standard Deviations – Not so Good
33,000+ performance metrics analyzed using +/-
2.5σ
0
1000
2000
3000
4000
5000
6000
7000
28 Feb 00:00 28 Feb 12:00 01 Mar 00:00 01 Mar 12:00 02 Mar 00:00 02 Mar 12:00 03 Mar 00:00 03 Mar 12:00
• Never less than 900 alerts per hour
• Real outage (circled)
overshadowed by ~6000
extraneous alerts
Total # Alerts
18. Don’t worry, we have you covered
• Prelert uses sophisticated
machine-learning techniques
to best-fit the right statistical
model for your data.
• Better models = better outlier
detection = less false alarms
20
20. Kinds of Anomalies Detected
22
Deviations in event count vs. time
Deviations in values vs. time
Rare occurrences of things
Population/Peer outliers
21. #1) Deviations in Event Counts/Rates
• Use Case: Online Commerce Site
Cyclical online ordering volume (credit cards, etc.)
Service outage on May 10th orders not being processed, dip in afternoon volume
23
22. Hard to automatically detect because…
• Tricky to catch with thresholds because overall count didn’t dip below low watermark
• Output of Splunk “predict”:
24
23. Prelert finds the anomaly perfectly
25
• No extraneous false alarms
• Despite the inherent challenges of the periodic nature of the data
24. #2) Deviations in Performance Metrics
• Use Case: Online travel portal
• Makes web services calls to airlines for fare quotes
• Each airline responds to fare request with its own typical response
time (20 airlines):
26
25. Hard to automatically detect because…
• Tricky to construct unique thresholds for each airline individually
• Cannot do “avg +/- 2σ” because it is too noisy for this kind of data
• Splunk’s “predict” doesn’t support explosion out via by clause (“by airline”)
27
26. Prelert finds the anomaly perfectly
28
• Only 1 of the many airlines is having an issue
27. #3) Rare Items as Anomalies
• Use Case: Security team @ services company
• Wanted to profile typical processes on each host using netstat
• Goal was to identify rare processes that “start up and communicate”
for each host, individually
29
28. Hard to automatically detect because…
• Each host has it’s own separate “set” of typical processes
that are potentially unique
• i.e. FTP may run routinely run on server A, but never runs on server
B
• Maintaining a running list of “typical processes” across
hundreds of servers not practical
• Splunk “rare” command is not truly a rarity measurement,
just “least occurring”
30
29. Prelert finds the anomaly perfectly
31
• Finds FTP process running for 3 hours on system that doesn’t normally run FTP
30. #4) Population / Peer Outliers
• Use Case: Proxy log data
Need to determine which users/systems are sending
out requests/data much differently than the others
32
31. Hard to automatically detect because…
• Peer analysis is impossible without Prelert
33
32. Prelert finds the anomaly perfectly
34
• One particular host sending many requests (20,000/hr) to an IIS webserver
• This is an attempt to hack the webserver
33. Anomaly Detective App
• Free to download and try – 100% native Splunk app
• Easy to use – “push button anomaly detection”
• More powerful anomaly detection than Splunk on its own
• Scalable for big data sets
35
http://goo.gl/KJY9B
34. Bonus – Anomaly Cross-Correlation
• Use Case: Retail company with flaky POS application (gift card
redemption)
App occasionally disconnects from DB
Team suspects either a DB or a network problem, but hard to find cause
• Prelert configured to run anomaly detection across 3 data types
simultaneously
App logs (unstructured) – count by dynamic message type
SQL Server performance metrics
Network performance metrics
36
35. Result: Instant Answers
37
Symptom: Sudden
influx of DB errors
in log
Symptom: Drop in
SQL Server client
connections
Cause: Network
spike and TCP
discards
Notas do Editor
[no audio here]
Probability of data comes in all shapes and sizes – rarely does it fit a nice bell curve