Anomaly Detection using Spark MLlib and Spark Streaming
1. Anomaly Detection
Offline Training using Spark Mllib;
Online Testing using Spark Streaming;
Details: https://github.com/keiraqz/anomaly-detection
Keira Zhou Dec, 2015
2. The Model
Model is trained using KMeans(Spark MLlib K-means)
approach
Trained on "normal" dataset only
After the model is trained, the centroid of the "normal"
dataset will be returned as well as a threshold
During the validation stage, any data points that are
further than the threshold from the centroid are
considered as "anomalies".
3. Dataset
The dataset is downloaded from KDD Cup 1999 Data
for Anomaly Detection [1]
Training Set: The training set is separated from the
whole dataset with the data points that are labeled as
"normal" only
Validation Set: The validation set is using the whole
dataset. All data points that are NOT labeled as
"normal" are considered as "anomalies”
[1] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
4. Offline Training
The majority of the training code mainly follows the
tutorial from Sean Owen, Cloudera:
Video: https://www.youtube.com/watch?v=TC5cKYBZAeI
Slides-1: http://www.slideshare.net/CIGTR/anomaly-detection-with-
apache-spark
Slides-2: http://www.slideshare.net/cloudera/anomaly-detection-with-
apache-spark-2
Couple of modifications have been made to fit
personal interest:
Instead of training multiple clusters, the code only trains on "normal"
data points
Only one cluster center is recorded and threshold is set to the last of
the furthest 2000 data points
During later validating stage, all points that are further than the
threshold is labeled as "anomaly"
5. Online Testing
Validation is run as a streaming job using Spark
Streaming
Currently the application reads the input data from a
local file
In an ideal situation, the program will read the data from
some ingestion tools such as Kafka
Also, the trained model (centroid and threshold) is
also saved in a local file
In production, the information should be saved into a
database
6. Spark Streaming context: process every 3 seconds
Load the trained model:
Load from local file and put into a queueStream
The streaming task: Calculate the distance between the data point
and the centroid, then compare to the threshold
7. Notes
Currently the application reads the input data from a
local file
In an ideal situation, the program will read the data from
some ingestion tools such as Kafka
Also, the trained model (centroid and threshold) is
also saved in a local file
In production, the information should be saved into a
database
The output of the testing can be saved into a
database for visualization