SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)

SAMOA: A Platform for
Mining Big Data Streams
Nicolas Kourtellis
Associate Researcher
Telefonica I+D, Barcelona
@kourtellis
@ApacheSAMOA
1

What is Big Data?
Search queries
Facebook posts
Emails
Tweets
Photo shares
Clicks on ads
…
2

How BIG is your data?
Volume (+ Variety)
Too large for RAM of single commodity server
Velocity
Too fast for CPU of single commodity server
3

What is the Streaming Paradigm?
High amount of data, high speed of arrival
Updated models at “real” time
Potentially infinite sequence of data
Change over time (concept drift)
4

Approximation algorithms:
Single pass, one data item at a time
Sub-linear space and time per data item
Small error with high probability
A platform solution:
Support different algorithms & processing engines
Distributed
Scalable
5

What is SAMOA?
Scalable Advanced Massive Online Analysis
A platform for mining big data streams
Framework for developing new distributed stream
mining algorithms
Framework for deploying algorithms on new distributed
stream processing engines
6

Taxonomy
Machine
Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA
7

SAMOA ArchitectureArchitecture
SASAMOA%
Machine Learning
Algorithms
Distributed Stream
Processing Engines
Flink
8

Why is SAMOA important?
Program once, run everywhere
Reuse existing infrastructure
Avoid deploy cycles
No system downtime
No complex backup/update process
No need to select update frequency
9

ML Developer API
ML Developer API
Processing Item
Processor
Stream
10

ML Developer API
L Developer API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
!
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
!
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShuffle(streamOne)
.connectInputKey(streamTwo);
ML Developer API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
!
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
!
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShuffle(streamOne)
11

Deployment
Deployment
SAMOA-S4.jar
SAMOA-API.jar
SAMOA-Storm.jar
samoa-storm-deployable.jar
samoa-s4-deployable.s4r
S4 bindings
Storm bindings
API. Algorithm developer
depends only on this
To S4 cluster
To Storm cluster
12

Easy to test!
bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar
"PrequentialEvaluation
-d /tmp/dump.csv
-i 1000000 -f 100000
-l (classifiers.trees.VerticalHoeffdingTree -p 4 -k)
-s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)"
16

Case study: Decision Trees
VHT: Vertical Hoeffding Tree*
17
Task Parallelism
Task parallelism
*VHT: Vertical Hoeffding Tree. N. Kourtellis,
G. De Francisci Morales, A. Bifet, A.
Mordupo. IEEE BigData 2016.

Case study: VHT
18
Horizontal Parallelism
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model UpdatesHorizontal Parallelism

Case study: VHT
19
Vertical Parallelism
Stats
Stats
Stats
Stream
Model
Attributes
SplitsVertical Parallelism

Benefits of Vertical Parallelism
High number of attributes:
high level parallelism (e.g., documents)
vs. task parallelism:
obvious parallelism observed
vs. horizontal parallelism:
reduced memory usage (no model replication)
parallelized split computation
20

Vertical Hoeffding Tree
21
Vertical Hoeffding Tree
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping

Preliminary results: Dense instances
Random decision tree
Mixed categorical and numerical attributes
10-10, 100-100, 1k-1k, 10k-10k
Instances: 1,000,000
2 balanced classes
10 different seeded runs
Test every 100k instances
MOA HT vs. Local VHT vs. Storm cluster VHT
22

Results: Accuracy
23
80
85
90
95
100
10-10 100-100 1k-1k 10k-10k
%accuracy
nominal attributes - numerical attributes
Dense attributes
local
moa
100

Results: Accuracy
0
20
40
60
80
100
10-10 100-100 1k-1k 10k-10k
%accuracy
parallelism = 2
sharding wok wk(0) wk(1k) wk(10k) local
0
20
40
60
80
100
10-10 100-100 1k-1k 10k-10k
nominal attributes - numerical attributes
parallelism = 4
1
24

Results: Accuracy Evolution
25

Preliminary results: Artificial Tweets
Zipf skew: 1.5
Bag of words: 100, 1000, 10000 (attributes)
Size of tweet: ~15 words
Instances: 1,000,000
Class: positive or negative
 Gaussian random variable
10 different seeded runs
Test every 100k instances
MOA HT vs. Local VHT vs. Storm cluster VHT
28

Results: Accuracy Evolution
31

Is SAMOA for you?
Are you dealing with:
Big fast data?
Possibly endless streams of data?
Evolving data?
Do you need updated models at real time?
Do you want to test an algorithm on
different DSPEs?
34

SAMOA Team
Albert Bifet
Gianmarco
De Francisci Morales
Nicolas Kourtellis
Matthieu Morel
Arinto Murdopo
Olivier Van Laere
35

Status
 Apache Incubator
 Released version 0.3.0 in July
 Execution Engines
 Input:
 Local FS
 HDFS
 Avro
 Kafka [pending]
Parallel algorithms
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pending]
Execution engines
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pending]
Execution engines
sification)
ession)
ining) [pending]
Heron?
36
Apache
Beam?

Algorithms in SAMOA
Existing:
 Vertical Hoeffding Tree (classification)
 CluStream (clustering)
 Adaptive Model Rules (regression)
Pending:
 Distributed Naïve Bayes
 Stochastic Gradient Descent
 Adaptive + Boosting VHT
 Parallelized Gradient Boosted Decision Tree
 PARMA (frequent pattern mining)
 …
Check Samoa Roadmap for more
Looking for
contributors!
37

SAMOA: A Platform for
@ApacheSAMOA
http://samoa.incubator.apache.org/
https://github.com/apache/incubator-samoa
Nicolas Kourtellis
@kourtellis
nicolas.kourtellis@telefonica.com
38

SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)

Semelhante a SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016) (20)

Mais de Nicolas Kourtellis

Mais de Nicolas Kourtellis (8)

Último

Último (20)

SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)