A general overview of the APACHE SAMOA platform for mining big data streams using machine learning algorithms running on distributed stream processing platforms such as Apache STORM, Apache Flink, Apache Samza and Apache Apex.
Results are shown from experimentation with VHT, the Vertical Hoeffding Tree proposed in "VHT: Vertical Hoeffding Tree." N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
Presentation in APACHE BIG DATA North America 2016
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)
1. SAMOA: A Platform for
Mining Big Data Streams
Nicolas Kourtellis
Associate Researcher
Telefonica I+D, Barcelona
@kourtellis
@ApacheSAMOA
1
2. What is Big Data?
Search queries
Facebook posts
Emails
Tweets
Photo shares
Clicks on ads
…
2
3. How BIG is your data?
Volume (+ Variety)
Too large for RAM of single commodity server
Velocity
Too fast for CPU of single commodity server
3
4. What is the Streaming Paradigm?
High amount of data, high speed of arrival
Updated models at “real” time
Potentially infinite sequence of data
Change over time (concept drift)
4
5. Mining Big Data Streams
Approximation algorithms:
Single pass, one data item at a time
Sub-linear space and time per data item
Small error with high probability
A platform solution:
Support different algorithms & processing engines
Distributed
Scalable
5
6. What is SAMOA?
Scalable Advanced Massive Online Analysis
A platform for mining big data streams
Framework for developing new distributed stream
mining algorithms
Framework for deploying algorithms on new distributed
stream processing engines
6
9. Why is SAMOA important?
Program once, run everywhere
Reuse existing infrastructure
Avoid deploy cycles
No system downtime
No complex backup/update process
No need to select update frequency
9
17. Case study: Decision Trees
VHT: Vertical Hoeffding Tree*
17
Task Parallelism
Task parallelism
*VHT: Vertical Hoeffding Tree. N. Kourtellis,
G. De Francisci Morales, A. Bifet, A.
Mordupo. IEEE BigData 2016.
18. Case study: VHT
18
Horizontal Parallelism
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model UpdatesHorizontal Parallelism
19. Case study: VHT
19
Vertical Parallelism
Stats
Stats
Stats
Stream
Model
Attributes
SplitsVertical Parallelism
20. Benefits of Vertical Parallelism
High number of attributes:
high level parallelism (e.g., documents)
vs. task parallelism:
obvious parallelism observed
vs. horizontal parallelism:
reduced memory usage (no model replication)
parallelized split computation
20
21. Vertical Hoeffding Tree
21
Vertical Hoeffding Tree
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping
22. Preliminary results: Dense instances
Random decision tree
Mixed categorical and numerical attributes
10-10, 100-100, 1k-1k, 10k-10k
Instances: 1,000,000
2 balanced classes
10 different seeded runs
Test every 100k instances
MOA HT vs. Local VHT vs. Storm cluster VHT
22
28. Preliminary results: Artificial Tweets
Zipf skew: 1.5
Bag of words: 100, 1000, 10000 (attributes)
Size of tweet: ~15 words
Instances: 1,000,000
Class: positive or negative
Gaussian random variable
10 different seeded runs
Test every 100k instances
MOA HT vs. Local VHT vs. Storm cluster VHT
28
34. Is SAMOA for you?
Are you dealing with:
Big fast data?
Possibly endless streams of data?
Evolving data?
Do you need updated models at real time?
Do you want to test an algorithm on
different DSPEs?
34
36. Status
Apache Incubator
Released version 0.3.0 in July
Execution Engines
Input:
Local FS
HDFS
Avro
Kafka [pending]
Parallel algorithms
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pending]
Execution engines
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pending]
Execution engines
sification)
ession)
ining) [pending]
Heron?
36
Apache
Beam?
37. Algorithms in SAMOA
Existing:
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
Pending:
Distributed Naïve Bayes
Stochastic Gradient Descent
Adaptive + Boosting VHT
Parallelized Gradient Boosted Decision Tree
PARMA (frequent pattern mining)
…
Check Samoa Roadmap for more
Looking for
contributors!
37
38. SAMOA: A Platform for
Mining Big Data Streams
@ApacheSAMOA
http://samoa.incubator.apache.org/
https://github.com/apache/incubator-samoa
Nicolas Kourtellis
@kourtellis
nicolas.kourtellis@telefonica.com
38