Stream processing comparison

Stream processing systems comparison
Yangjun Wang
Department of Information and Computer Science
Aalto University, School of Science
yangjun.wang@aalto.ﬁ
January 20, 2016

January 20, 2016
2/15
Introduction
Process model of many big data applications are changed
from batch processing to stream processing
batch processing has advantages in throughput, while
latency of stream processing is much shorter
stream processing could get very high throughput too

January 20, 2016
3/15
Introduction
Process model of many big data applications are changed
from batch processing to stream processing
batch processing has advantages in throughput, while
latency of stream processing is much shorter
stream processing could get very high throughput too
Widely used stream processing systems: Storm, Spark
streaming, Flink, Samza

January 20, 2016
4/15
Comparison
Processing model
Storm and Flink are real stream processing which process
record one by one
Spark streaming is micro-batch which process very small
batches continuously
Storm’s Trident also provides micro-batch API
Throughput of WordCount – skewed data
Flink – 300K/s (4 cores, 15 GB ROM)
Storm(ack enabled) – 5K/s node
Spark stream – (250 ∼ 2500(batch))K/s

January 20, 2016
5/15
Comparison (cnt.)
Latency of WordCount – skewed data
Flink: around 50ms (90%)
Storm: around 55ms (90%)
Spark: 1s ∼ ... (depends on interval)

January 20, 2016
6/15
Comparison (cnt.)
Usage of Spark and Flink
Flink and Spark provide many high-level operations which could
be used easily as:
stream1.flatMap(...)
.mapToPair(...)
.reduceByKey(...)
Usage of Storm
In storm applications, we need deﬁne stream sources(spout) all
process logic(bolt) by ourselves.

January 20, 2016
7/15
Comparison (cnt.)
Usage
More work need be done in storm applications, but we get
more ﬂexibility.
Flink provides low-level operators which are similar to
Storm Bolts such as OneInputStreamOperator,
TwoInputStreamOperator. These operators are not too
complex to use.
Spark streaming low-level operators are a little hard to use.
Spark streaming could also lose some ability because of
micro-batch processing model.

January 20, 2016
8/15
Example
Problem
There are two streams: advertisement(advId, shownTime)
and click(advId, clickTime). How to get a stream that
contains all clicked advertisements (advId, shownTime,
clickTime) which are clicked in 10 minutes after shown?

January 20, 2016
9/15
Example
Problem
There are two streams: advertisement(advId, shownTime)
and click(advId, clickTime). How to get a stream that
contains all clicked advertisements (advId, shownTime,
clickTime) which are clicked in 10 minutes after shown?
Solution of Storm
Implement a bolt which receives records from two spouts,
cache records and do join operation

January 20, 2016
10/15
Example (cnt.)
Problems of Flink
1. Flink only provides join operation on the same window
2. Window without slides will cause data missing
3. Window with slides could introduce duplicate data

January 20, 2016
11/15
Example (cnt.)
Problems of Flink
1. Flink only provides join operation on the same window
2. Window without slides will cause data missing
3. Window with slides could introduce duplicate data
Solution of Flink
Implement a join operator extend
TwoInputStreamOperator which is similar to
WindowOperator.
The self-implemented operator is similar to storm solution
at some point.

January 20, 2016
12/15
Example (cnt.)
Problems of Spark
1. Spark doesn’t support event time join and watermark
2. Similar problems with Flink(2, 3)

January 20, 2016
13/15
Example (cnt.)
Problems of Spark
1. Spark doesn’t support event time join and watermark
2. Similar problems with Flink(2, 3)
Solution of Spark
advertisement.window(11 mins, 1min)
.join(click.window(1min, 1min))
.filter(...)
Issues
Spark only supports join on processing time
Filter operations is base on event time
Data missing if advertisement records arrive later(delay)

January 20, 2016
14/15
Summary
Comparison summary table:
Storm Spark Flink
Model stream micro-batch stream
Throughput low high high
Latency low high low
Usage complex easy easy
Flexible very flexible flexible inflexible

January 20, 2016
15/15
Thanks

Stream processing comparison

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (7)

Semelhante a Stream processing comparison

Semelhante a Stream processing comparison (20)

Último

Último (20)

Stream processing comparison