SlideShare uma empresa Scribd logo
1 de 38
SAMOA: A Platform for
Mining Big Data Streams
Nicolas Kourtellis
Associate Researcher
Telefonica I+D, Barcelona
@kourtellis
@ApacheSAMOA
1
What is Big Data?
Search queries
Facebook posts
Emails
Tweets
Photo shares
Clicks on ads
…
2
How BIG is your data?
Volume (+ Variety)
Too large for RAM of single commodity server
Velocity
Too fast for CPU of single commodity server
3
What is the Streaming Paradigm?
High amount of data, high speed of arrival
Updated models at “real” time
Potentially infinite sequence of data
Change over time (concept drift)
4
Mining Big Data Streams
Approximation algorithms:
Single pass, one data item at a time
Sub-linear space and time per data item
Small error with high probability
A platform solution:
Support different algorithms & processing engines
Distributed
Scalable
5
What is SAMOA?
Scalable Advanced Massive Online Analysis
A platform for mining big data streams
Framework for developing new distributed stream
mining algorithms
Framework for deploying algorithms on new distributed
stream processing engines
6
Taxonomy
Machine
Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA
7
SAMOA ArchitectureArchitecture
SASAMOA%
Machine Learning
Algorithms
Distributed Stream
Processing Engines
Flink
8
Why is SAMOA important?
Program once, run everywhere
Reuse existing infrastructure
Avoid deploy cycles
No system downtime
No complex backup/update process
No need to select update frequency
9
ML Developer API
ML Developer API
Processing Item
Processor
Stream
10
ML Developer API
L Developer API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
!
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
!
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShuffle(streamOne)
.connectInputKey(streamTwo);
ML Developer API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
!
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
!
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShuffle(streamOne)
11
Deployment
Deployment
SAMOA-S4.jar
SAMOA-API.jar
SAMOA-Storm.jar
samoa-storm-deployable.jar
samoa-s4-deployable.s4r
S4 bindings
Storm bindings
API. Algorithm developer
depends only on this
To S4 cluster
To Storm cluster
12
Easy to get!
13
Easy to get!
14
Easy to get!
15
Easy to test!
bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar
"PrequentialEvaluation
-d /tmp/dump.csv
-i 1000000 -f 100000
-l (classifiers.trees.VerticalHoeffdingTree -p 4 -k)
-s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)"
16
Case study: Decision Trees
VHT: Vertical Hoeffding Tree*
17
Task Parallelism
Task parallelism
*VHT: Vertical Hoeffding Tree. N. Kourtellis,
G. De Francisci Morales, A. Bifet, A.
Mordupo. IEEE BigData 2016.
Case study: VHT
18
Horizontal Parallelism
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model UpdatesHorizontal Parallelism
Case study: VHT
19
Vertical Parallelism
Stats
Stats
Stats
Stream
Model
Attributes
SplitsVertical Parallelism
Benefits of Vertical Parallelism
High number of attributes:
high level parallelism (e.g., documents)
vs. task parallelism:
obvious parallelism observed
vs. horizontal parallelism:
reduced memory usage (no model replication)
parallelized split computation
20
Vertical Hoeffding Tree
21
Vertical Hoeffding Tree
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping
Preliminary results: Dense instances
Random decision tree
Mixed categorical and numerical attributes
10-10, 100-100, 1k-1k, 10k-10k
Instances: 1,000,000
2 balanced classes
10 different seeded runs
Test every 100k instances
MOA HT vs. Local VHT vs. Storm cluster VHT
22
Results: Accuracy
23
80
85
90
95
100
10-10 100-100 1k-1k 10k-10k
%accuracy
nominal attributes - numerical attributes
Dense attributes
local
moa
100
Results: Accuracy
0
20
40
60
80
100
10-10 100-100 1k-1k 10k-10k
%accuracy
parallelism = 2
sharding wok wk(0) wk(1k) wk(10k) local
0
20
40
60
80
100
10-10 100-100 1k-1k 10k-10k
nominal attributes - numerical attributes
parallelism = 4
1
24
Results: Accuracy Evolution
25
Results: Speedup
26
Results: Speedup
27
Preliminary results: Artificial Tweets
Zipf skew: 1.5
Bag of words: 100, 1000, 10000 (attributes)
Size of tweet: ~15 words
Instances: 1,000,000
Class: positive or negative
 Gaussian random variable
10 different seeded runs
Test every 100k instances
MOA HT vs. Local VHT vs. Storm cluster VHT
28
Results: Accuracy
29
Results: Accuracy
30
Results: Accuracy Evolution
31
Results: Speedup
32
Results: Speedup
33
Is SAMOA for you?
Are you dealing with:
Big fast data?
Possibly endless streams of data?
Evolving data?
Do you need updated models at real time?
Do you want to test an algorithm on
different DSPEs?
34
SAMOA Team
Albert Bifet
Gianmarco
De Francisci Morales
Nicolas Kourtellis
Matthieu Morel
Arinto Murdopo
Olivier Van Laere
35
Status
 Apache Incubator
 Released version 0.3.0 in July
 Execution Engines
 Input:
 Local FS
 HDFS
 Avro
 Kafka [pending]
Parallel algorithms
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pending]
Execution engines
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pending]
Execution engines
sification)
ession)
ining) [pending]
Heron?
36
Apache
Beam?
Algorithms in SAMOA
Existing:
 Vertical Hoeffding Tree (classification)
 CluStream (clustering)
 Adaptive Model Rules (regression)
Pending:
 Distributed Naïve Bayes
 Stochastic Gradient Descent
 Adaptive + Boosting VHT
 Parallelized Gradient Boosted Decision Tree
 PARMA (frequent pattern mining)
 …
Check Samoa Roadmap for more
Looking for
contributors!
37
SAMOA: A Platform for
Mining Big Data Streams
@ApacheSAMOA
http://samoa.incubator.apache.org/
https://github.com/apache/incubator-samoa
Nicolas Kourtellis
@kourtellis
nicolas.kourtellis@telefonica.com
38

Mais conteúdo relacionado

Mais procurados

Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesStratio
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerApache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio
 

Mais procurados (20)

Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerApache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
 

Destaque

Lianjia data infrastructure, Yi Lyu
Lianjia data infrastructure, Yi LyuLianjia data infrastructure, Yi Lyu
Lianjia data infrastructure, Yi Lyu毅 吕
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Denodo
 
Callcenter HPE IDOL overview
Callcenter HPE IDOL overviewCallcenter HPE IDOL overview
Callcenter HPE IDOL overviewTania Akinina
 
ANTS - 360 view of your customer - bigdata innovation summit 2016
ANTS - 360 view of your customer - bigdata innovation summit 2016ANTS - 360 view of your customer - bigdata innovation summit 2016
ANTS - 360 view of your customer - bigdata innovation summit 2016Dinh Le Dat (Kevin D.)
 
クラウドを活用した自由自在なデータ分析
クラウドを活用した自由自在なデータ分析クラウドを活用した自由自在なデータ分析
クラウドを活用した自由自在なデータ分析aiichiro
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Oxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigDataOxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigDataLudovic Piot
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB WorkshopAhmed Salman
 
GCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow IntroductionGCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow IntroductionSimon Su
 
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북BOAZ Bigdata
 
BigData & Hadoop - Technology Latinoware 2016
BigData & Hadoop - Technology Latinoware 2016BigData & Hadoop - Technology Latinoware 2016
BigData & Hadoop - Technology Latinoware 2016Thiago Santiago
 
Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway Freek van Gool
 
Big Data Patients and New Requirements for Clinical Systems
Big Data Patients and New Requirements for Clinical SystemsBig Data Patients and New Requirements for Clinical Systems
Big Data Patients and New Requirements for Clinical SystemsAlexandre Prozoroff
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander
 

Destaque (20)

Lianjia data infrastructure, Yi Lyu
Lianjia data infrastructure, Yi LyuLianjia data infrastructure, Yi Lyu
Lianjia data infrastructure, Yi Lyu
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
 
Callcenter HPE IDOL overview
Callcenter HPE IDOL overviewCallcenter HPE IDOL overview
Callcenter HPE IDOL overview
 
ANTS - 360 view of your customer - bigdata innovation summit 2016
ANTS - 360 view of your customer - bigdata innovation summit 2016ANTS - 360 view of your customer - bigdata innovation summit 2016
ANTS - 360 view of your customer - bigdata innovation summit 2016
 
クラウドを活用した自由自在なデータ分析
クラウドを活用した自由自在なデータ分析クラウドを活用した自由自在なデータ分析
クラウドを活用した自由自在なデータ分析
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Oxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigDataOxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigData
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
GCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow IntroductionGCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow Introduction
 
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
 
BigData & Hadoop - Technology Latinoware 2016
BigData & Hadoop - Technology Latinoware 2016BigData & Hadoop - Technology Latinoware 2016
BigData & Hadoop - Technology Latinoware 2016
 
Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway
 
Big Data Patients and New Requirements for Clinical Systems
Big Data Patients and New Requirements for Clinical SystemsBig Data Patients and New Requirements for Clinical Systems
Big Data Patients and New Requirements for Clinical Systems
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Big Data for Big Results in Chinese Social Media
Big Data for Big Results in Chinese Social MediaBig Data for Big Results in Chinese Social Media
Big Data for Big Results in Chinese Social Media
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
 

Semelhante a SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)

Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...Codemotion
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera, Inc.
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin MeetupMárton Balassi
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basFlorent Ramiere
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-PipelinesTimothy Spann
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsSrinath Perera
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner
 

Semelhante a SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016) (20)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 

Mais de Nicolas Kourtellis

Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsNicolas Kourtellis
 
On managing social data for enabling socially-aware applications and services
On managing social data for enabling socially-aware applications and servicesOn managing social data for enabling socially-aware applications and services
On managing social data for enabling socially-aware applications and servicesNicolas Kourtellis
 
Scalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsScalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsNicolas Kourtellis
 
Prometheus: Distributed Management of Geo-Social Data
Prometheus: Distributed Management of Geo-Social DataPrometheus: Distributed Management of Geo-Social Data
Prometheus: Distributed Management of Geo-Social DataNicolas Kourtellis
 
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...Nicolas Kourtellis
 
Cultures in Community Question Answering
Cultures in Community Question AnsweringCultures in Community Question Answering
Cultures in Community Question AnsweringNicolas Kourtellis
 
Privacy Concerns vs. User Behavior in Community Question Answering
Privacy Concerns vs. User Behavior in Community Question AnsweringPrivacy Concerns vs. User Behavior in Community Question Answering
Privacy Concerns vs. User Behavior in Community Question AnsweringNicolas Kourtellis
 
VHT: Vertical Hoeffding Tree (IEEE BigData 2016)
VHT: Vertical Hoeffding Tree (IEEE BigData 2016)VHT: Vertical Hoeffding Tree (IEEE BigData 2016)
VHT: Vertical Hoeffding Tree (IEEE BigData 2016)Nicolas Kourtellis
 

Mais de Nicolas Kourtellis (8)

Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
 
On managing social data for enabling socially-aware applications and services
On managing social data for enabling socially-aware applications and servicesOn managing social data for enabling socially-aware applications and services
On managing social data for enabling socially-aware applications and services
 
Scalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsScalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving Graphs
 
Prometheus: Distributed Management of Geo-Social Data
Prometheus: Distributed Management of Geo-Social DataPrometheus: Distributed Management of Geo-Social Data
Prometheus: Distributed Management of Geo-Social Data
 
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
 
Cultures in Community Question Answering
Cultures in Community Question AnsweringCultures in Community Question Answering
Cultures in Community Question Answering
 
Privacy Concerns vs. User Behavior in Community Question Answering
Privacy Concerns vs. User Behavior in Community Question AnsweringPrivacy Concerns vs. User Behavior in Community Question Answering
Privacy Concerns vs. User Behavior in Community Question Answering
 
VHT: Vertical Hoeffding Tree (IEEE BigData 2016)
VHT: Vertical Hoeffding Tree (IEEE BigData 2016)VHT: Vertical Hoeffding Tree (IEEE BigData 2016)
VHT: Vertical Hoeffding Tree (IEEE BigData 2016)
 

Último

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 

Último (20)

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 

SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)

  • 1. SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona @kourtellis @ApacheSAMOA 1
  • 2. What is Big Data? Search queries Facebook posts Emails Tweets Photo shares Clicks on ads … 2
  • 3. How BIG is your data? Volume (+ Variety) Too large for RAM of single commodity server Velocity Too fast for CPU of single commodity server 3
  • 4. What is the Streaming Paradigm? High amount of data, high speed of arrival Updated models at “real” time Potentially infinite sequence of data Change over time (concept drift) 4
  • 5. Mining Big Data Streams Approximation algorithms: Single pass, one data item at a time Sub-linear space and time per data item Small error with high probability A platform solution: Support different algorithms & processing engines Distributed Scalable 5
  • 6. What is SAMOA? Scalable Advanced Massive Online Analysis A platform for mining big data streams Framework for developing new distributed stream mining algorithms Framework for deploying algorithms on new distributed stream processing engines 6
  • 9. Why is SAMOA important? Program once, run everywhere Reuse existing infrastructure Avoid deploy cycles No system downtime No complex backup/update process No need to select update frequency 9
  • 10. ML Developer API ML Developer API Processing Item Processor Stream 10
  • 11. ML Developer API L Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) 11
  • 16. Easy to test! bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4 -k) -s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)" 16
  • 17. Case study: Decision Trees VHT: Vertical Hoeffding Tree* 17 Task Parallelism Task parallelism *VHT: Vertical Hoeffding Tree. N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
  • 18. Case study: VHT 18 Horizontal Parallelism Stats Stats Stats Stream Histograms Model Instances Model UpdatesHorizontal Parallelism
  • 19. Case study: VHT 19 Vertical Parallelism Stats Stats Stats Stream Model Attributes SplitsVertical Parallelism
  • 20. Benefits of Vertical Parallelism High number of attributes: high level parallelism (e.g., documents) vs. task parallelism: obvious parallelism observed vs. horizontal parallelism: reduced memory usage (no model replication) parallelized split computation 20
  • 21. Vertical Hoeffding Tree 21 Vertical Hoeffding Tree Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) InstanceStream Shuffle Grouping Key Grouping All Grouping
  • 22. Preliminary results: Dense instances Random decision tree Mixed categorical and numerical attributes 10-10, 100-100, 1k-1k, 10k-10k Instances: 1,000,000 2 balanced classes 10 different seeded runs Test every 100k instances MOA HT vs. Local VHT vs. Storm cluster VHT 22
  • 23. Results: Accuracy 23 80 85 90 95 100 10-10 100-100 1k-1k 10k-10k %accuracy nominal attributes - numerical attributes Dense attributes local moa 100
  • 24. Results: Accuracy 0 20 40 60 80 100 10-10 100-100 1k-1k 10k-10k %accuracy parallelism = 2 sharding wok wk(0) wk(1k) wk(10k) local 0 20 40 60 80 100 10-10 100-100 1k-1k 10k-10k nominal attributes - numerical attributes parallelism = 4 1 24
  • 28. Preliminary results: Artificial Tweets Zipf skew: 1.5 Bag of words: 100, 1000, 10000 (attributes) Size of tweet: ~15 words Instances: 1,000,000 Class: positive or negative  Gaussian random variable 10 different seeded runs Test every 100k instances MOA HT vs. Local VHT vs. Storm cluster VHT 28
  • 34. Is SAMOA for you? Are you dealing with: Big fast data? Possibly endless streams of data? Evolving data? Do you need updated models at real time? Do you want to test an algorithm on different DSPEs? 34
  • 35. SAMOA Team Albert Bifet Gianmarco De Francisci Morales Nicolas Kourtellis Matthieu Morel Arinto Murdopo Olivier Van Laere 35
  • 36. Status  Apache Incubator  Released version 0.3.0 in July  Execution Engines  Input:  Local FS  HDFS  Avro  Kafka [pending] Parallel algorithms Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression) PARMA (frequent pattern mining) [pending] Execution engines Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression) PARMA (frequent pattern mining) [pending] Execution engines sification) ession) ining) [pending] Heron? 36 Apache Beam?
  • 37. Algorithms in SAMOA Existing:  Vertical Hoeffding Tree (classification)  CluStream (clustering)  Adaptive Model Rules (regression) Pending:  Distributed Naïve Bayes  Stochastic Gradient Descent  Adaptive + Boosting VHT  Parallelized Gradient Boosted Decision Tree  PARMA (frequent pattern mining)  … Check Samoa Roadmap for more Looking for contributors! 37
  • 38. SAMOA: A Platform for Mining Big Data Streams @ApacheSAMOA http://samoa.incubator.apache.org/ https://github.com/apache/incubator-samoa Nicolas Kourtellis @kourtellis nicolas.kourtellis@telefonica.com 38