Anomaly Detection using Spark MLlib and Spark Streaming

•Transferir como PPTX, PDF•

7 gostaram•5,181 visualizações

Keira Zhou

An Anomaly Detection example using Spark MLlib for offline training and Spark Streaming for online testing.

Dados e análise

Anomaly Detection
Offline Training using Spark Mllib;
Online Testing using Spark Streaming;
Details: https://github.com/keiraqz/anomaly-detection
Keira Zhou Dec, 2015

The Model
 Model is trained using KMeans(Spark MLlib K-means)
approach
 Trained on "normal" dataset only
 After the model is trained, the centroid of the "normal"
dataset will be returned as well as a threshold
 During the validation stage, any data points that are
further than the threshold from the centroid are
considered as "anomalies".

Dataset
 The dataset is downloaded from KDD Cup 1999 Data
for Anomaly Detection [1]
 Training Set: The training set is separated from the
whole dataset with the data points that are labeled as
"normal" only
 Validation Set: The validation set is using the whole
dataset. All data points that are NOT labeled as
"normal" are considered as "anomalies”
[1] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Offline Training
 The majority of the training code mainly follows the
tutorial from Sean Owen, Cloudera:
 Video: https://www.youtube.com/watch?v=TC5cKYBZAeI
 Slides-1: http://www.slideshare.net/CIGTR/anomaly-detection-with-
apache-spark
 Slides-2: http://www.slideshare.net/cloudera/anomaly-detection-with-
apache-spark-2
 Couple of modifications have been made to fit
personal interest:
 Instead of training multiple clusters, the code only trains on "normal"
data points
 Only one cluster center is recorded and threshold is set to the last of
the furthest 2000 data points
 During later validating stage, all points that are further than the
threshold is labeled as "anomaly"

Online Testing
 Validation is run as a streaming job using Spark
Streaming
 Currently the application reads the input data from a
local file
 In an ideal situation, the program will read the data from
some ingestion tools such as Kafka
 Also, the trained model (centroid and threshold) is
also saved in a local file
 In production, the information should be saved into a
database

 Spark Streaming context: process every 3 seconds
 Load the trained model:
 Load from local file and put into a queueStream
 The streaming task: Calculate the distance between the data point
and the centroid, then compare to the threshold

Notes
 Currently the application reads the input data from a
local file
 In an ideal situation, the program will read the data from
some ingestion tools such as Kafka
 Also, the trained model (centroid and threshold) is
also saved in a local file
 In production, the information should be saved into a
database
 The output of the testing can be saved into a
database for visualization

More Details
 https://github.com/keiraqz/anomaly-detection

Mais conteúdo relacionado

Mais procurados

AWSでDockerを扱うためのベストプラクティスAmazon Web Services Japan

20180220 AWS Black Belt Online Seminar - Amazon Container ServicesAmazon Web Services Japan

TPC-DSから学ぶPostgreSQLの弱点と今後の展望Kohei KaiGai

Hadoop -ResourceManager HAの仕組み-Yuki Gonda

Amazon ElastiCacheのはじめ方Amazon Web Services Japan

Library Operating System for Linux #netdev01Hajime Tazaki

MariaDB+GaleraClusterの運用事例(MySQL勉強会2016-01-28)Yuji Otani

Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...ScyllaDB

NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA Japan

#reco_tech Cloud searchでレコチョク検索の実現に向けてrecotech

第9回ACRiウェビナー_日立／島田様ご講演資料直久住川

Ceph Day Beijing - SPDK for CephDanielle Womboldt

AWS Black Belt Online Seminar 2016 Amazon VPCAmazon Web Services Japan

オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜Yahoo!デベロッパーネットワーク

OSNoise Tracer: Who Is Stealing My CPU Time?ScyllaDB

TVM の紹介Masahiro Masuda

TIME_WAITに関する話Takanori Sejima

DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...NECST Lab @ Politecnico di Milano

HANAのハナシの基本のきKoji Shinkubo

CDNのトラフィックエンジニアリング:CDNの現状とSDNの可能性J-Stream Inc.

Mais procurados (20)

AWSでDockerを扱うためのベストプラクティス

20180220 AWS Black Belt Online Seminar - Amazon Container Services

TPC-DSから学ぶPostgreSQLの弱点と今後の展望

Hadoop -ResourceManager HAの仕組み-

Amazon ElastiCacheのはじめ方

Library Operating System for Linux #netdev01

MariaDB+GaleraClusterの運用事例(MySQL勉強会2016-01-28)

Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...

NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化

#reco_tech Cloud searchでレコチョク検索の実現に向けて

第9回ACRiウェビナー_日立／島田様ご講演資料

Ceph Day Beijing - SPDK for Ceph

AWS Black Belt Online Seminar 2016 Amazon VPC

オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜

OSNoise Tracer: Who Is Stealing My CPU Time?

TVM の紹介

TIME_WAITに関する話

DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...

HANAのハナシの基本のき

CDNのトラフィックエンジニアリング:CDNの現状とSDNの可能性

Semelhante a Anomaly Detection using Spark MLlib and Spark Streaming

Developing and deploying big data machine learning modelsNarayana Swamy

OpenCL Programming 101Yoss Cohen

Android pentesting the hackers-meetupkunwaratul hax0r

Iac d.damyanov 4.pptxDimitar Damyanov

Prediction io 架構與整合 -DataCon.TW-2017William Lee

Andy Davis' Black Hat USA Presentation Revealing embedded fingerprintsNCC Group

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Smart Data Conference: DL4J and DataVecJosh Patterson

Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks

Azure machine learning serviceRuth Yakubu

Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks

What is New with Apache Spark Performance Monitoring in Spark 3.0Databricks

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson

DevOps for the Enterprise: Virtual Office HoursAmazon Web Services

Microsoft Windows Server AppFabricMark Ginnebaugh

More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedInCelia Kung

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf

Deep Learning: DL4J and DataVecJosh Patterson

Semelhante a Anomaly Detection using Spark MLlib and Spark Streaming (20)

Developing and deploying big data machine learning models

OpenCL Programming 101

Android pentesting the hackers-meetup

Iac d.damyanov 4.pptx

Prediction io 架構與整合 -DataCon.TW-2017

Andy Davis' Black Hat USA Presentation Revealing embedded fingerprints

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Smart Data Conference: DL4J and DataVec

Headaches and Breakthroughs in Building Continuous Applications

Azure machine learning service

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins

What is New with Apache Spark Performance Monitoring in Spark 3.0

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...

DevOps for the Enterprise: Virtual Office Hours

Microsoft Windows Server AppFabric

More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Deep Learning: DL4J and DataVec

Último

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

Sequential and reinforcement learning for demand side management by Margaux B...Paris Women in Machine Learning and Data Science

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

PLE-statistics document for primary schscnajjemba

Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制vexqp

Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

怎样办理伦敦大学城市学院毕业证（CITY毕业证书）成绩单学校原版复制vexqp

Anomaly Detection using Spark MLlib and Spark Streaming

1. Anomaly Detection Offline Training using Spark Mllib; Online Testing using Spark Streaming; Details: https://github.com/keiraqz/anomaly-detection Keira Zhou Dec, 2015

2. The Model  Model is trained using KMeans(Spark MLlib K-means) approach  Trained on "normal" dataset only  After the model is trained, the centroid of the "normal" dataset will be returned as well as a threshold  During the validation stage, any data points that are further than the threshold from the centroid are considered as "anomalies".

3. Dataset  The dataset is downloaded from KDD Cup 1999 Data for Anomaly Detection [1]  Training Set: The training set is separated from the whole dataset with the data points that are labeled as "normal" only  Validation Set: The validation set is using the whole dataset. All data points that are NOT labeled as "normal" are considered as "anomalies” [1] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

4. Offline Training  The majority of the training code mainly follows the tutorial from Sean Owen, Cloudera:  Video: https://www.youtube.com/watch?v=TC5cKYBZAeI  Slides-1: http://www.slideshare.net/CIGTR/anomaly-detection-with- apache-spark  Slides-2: http://www.slideshare.net/cloudera/anomaly-detection-with- apache-spark-2  Couple of modifications have been made to fit personal interest:  Instead of training multiple clusters, the code only trains on "normal" data points  Only one cluster center is recorded and threshold is set to the last of the furthest 2000 data points  During later validating stage, all points that are further than the threshold is labeled as "anomaly"

5. Online Testing  Validation is run as a streaming job using Spark Streaming  Currently the application reads the input data from a local file  In an ideal situation, the program will read the data from some ingestion tools such as Kafka  Also, the trained model (centroid and threshold) is also saved in a local file  In production, the information should be saved into a database

6.  Spark Streaming context: process every 3 seconds  Load the trained model:  Load from local file and put into a queueStream  The streaming task: Calculate the distance between the data point and the centroid, then compare to the threshold

7. Notes  Currently the application reads the input data from a local file  In an ideal situation, the program will read the data from some ingestion tools such as Kafka  Also, the trained model (centroid and threshold) is also saved in a local file  In production, the information should be saved into a database  The output of the testing can be saved into a database for visualization

8. More Details  https://github.com/keiraqz/anomaly-detection

Anomaly Detection using Spark MLlib and Spark Streaming

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Anomaly Detection using Spark MLlib and Spark Streaming

Semelhante a Anomaly Detection using Spark MLlib and Spark Streaming (20)

Último

Último (20)

Anomaly Detection using Spark MLlib and Spark Streaming