1. Stream processing systems comparison
Yangjun Wang
Department of Information and Computer Science
Aalto University, School of Science
yangjun.wang@aalto.fi
January 20, 2016
2. Stream processing systems comparison
January 20, 2016
2/15
Introduction
Process model of many big data applications are changed
from batch processing to stream processing
batch processing has advantages in throughput, while
latency of stream processing is much shorter
stream processing could get very high throughput too
3. Stream processing systems comparison
January 20, 2016
3/15
Introduction
Process model of many big data applications are changed
from batch processing to stream processing
batch processing has advantages in throughput, while
latency of stream processing is much shorter
stream processing could get very high throughput too
Widely used stream processing systems: Storm, Spark
streaming, Flink, Samza
4. Stream processing systems comparison
January 20, 2016
4/15
Comparison
Processing model
Storm and Flink are real stream processing which process
record one by one
Spark streaming is micro-batch which process very small
batches continuously
Storm’s Trident also provides micro-batch API
Throughput of WordCount – skewed data
Flink – 300K/s (4 cores, 15 GB ROM)
Storm(ack enabled) – 5K/s node
Spark stream – (250 ∼ 2500(batch))K/s
5. Stream processing systems comparison
January 20, 2016
5/15
Comparison (cnt.)
Latency of WordCount – skewed data
Flink: around 50ms (90%)
Storm: around 55ms (90%)
Spark: 1s ∼ ... (depends on interval)
6. Stream processing systems comparison
January 20, 2016
6/15
Comparison (cnt.)
Usage of Spark and Flink
Flink and Spark provide many high-level operations which could
be used easily as:
stream1.flatMap(...)
.mapToPair(...)
.reduceByKey(...)
Usage of Storm
In storm applications, we need define stream sources(spout) all
process logic(bolt) by ourselves.
7. Stream processing systems comparison
January 20, 2016
7/15
Comparison (cnt.)
Usage
More work need be done in storm applications, but we get
more flexibility.
Flink provides low-level operators which are similar to
Storm Bolts such as OneInputStreamOperator,
TwoInputStreamOperator. These operators are not too
complex to use.
Spark streaming low-level operators are a little hard to use.
Spark streaming could also lose some ability because of
micro-batch processing model.
8. Stream processing systems comparison
January 20, 2016
8/15
Example
Problem
There are two streams: advertisement(advId, shownTime)
and click(advId, clickTime). How to get a stream that
contains all clicked advertisements (advId, shownTime,
clickTime) which are clicked in 10 minutes after shown?
9. Stream processing systems comparison
January 20, 2016
9/15
Example
Problem
There are two streams: advertisement(advId, shownTime)
and click(advId, clickTime). How to get a stream that
contains all clicked advertisements (advId, shownTime,
clickTime) which are clicked in 10 minutes after shown?
Solution of Storm
Implement a bolt which receives records from two spouts,
cache records and do join operation
10. Stream processing systems comparison
January 20, 2016
10/15
Example (cnt.)
Problems of Flink
1. Flink only provides join operation on the same window
2. Window without slides will cause data missing
3. Window with slides could introduce duplicate data
11. Stream processing systems comparison
January 20, 2016
11/15
Example (cnt.)
Problems of Flink
1. Flink only provides join operation on the same window
2. Window without slides will cause data missing
3. Window with slides could introduce duplicate data
Solution of Flink
Implement a join operator extend
TwoInputStreamOperator which is similar to
WindowOperator.
The self-implemented operator is similar to storm solution
at some point.
12. Stream processing systems comparison
January 20, 2016
12/15
Example (cnt.)
Problems of Spark
1. Spark doesn’t support event time join and watermark
2. Similar problems with Flink(2, 3)
13. Stream processing systems comparison
January 20, 2016
13/15
Example (cnt.)
Problems of Spark
1. Spark doesn’t support event time join and watermark
2. Similar problems with Flink(2, 3)
Solution of Spark
advertisement.window(11 mins, 1min)
.join(click.window(1min, 1min))
.filter(...)
Issues
Spark only supports join on processing time
Filter operations is base on event time
Data missing if advertisement records arrive later(delay)
14. Stream processing systems comparison
January 20, 2016
14/15
Summary
Comparison summary table:
Storm Spark Flink
Model stream micro-batch stream
Throughput low high high
Latency low high low
Usage complex easy easy
Flexible very flexible flexible inflexible