This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
2. Motivation
Many big-data applications need to process large data
streams in near-real time
Website monitoring
Fraud detection
Ad monetizationRequire tens to hundreds of nodes
Require second-scale latencies
3.
4. Challenge
• Stream processing systems must recover from failures
and stragglers quickly and efficiently
– More important for streaming systems than batch systems
• Traditional streaming systems don’t achieve these
properties simultaneously
6. Traditional Streaming Systems
• Continuous operator model mutable state
node 1
node 3
input
records
node 2
input
records
– Each node runs an operator
with in-memory mutable state
– For each input record, state is
updated and new records are
sent out
• Mutable state is lost if node fails
• Various techniques exist to make state fault-tolerant
7. Fault-tolerance inTraditional Systems
• Separate set of “hot failover” nodes
process the same data streams
• Synchronization protocols ensures
exact ordering of records in both sets
• On failure, the system switches over
to the failover nodes
sync
protocol
input
input
hot
failover
nodes
Fast recovery, but 2x hardware cost
Node Replication [e.g. Borealis, Flux ]
8. input
input
Fault-tolerance inTraditional Systems
Upstream Backup [e.g.TimeStream, Storm ]
cold failover
node
• Each node maintains backup of the
forwarded records since last checkpoint
• A “cold failover” node is maintained
• On failure, upstream nodes replay the
backup records serially to the failover
node to recreate the state
Only need 1 standby, but slow recovery
backup
replay
10. Our Goal
• Scales to hundreds of nodes
• Achieves second-scale latency
• Tolerate node failures and stragglers
• Sub-second fault and straggler recovery
• Minimal overhead beyond base processing
11. Our Goal
• Scales to hundreds of nodes
• Achieves second-scale latency
• Tolerate node failures and stragglers
• Sub-second fault and straggler recovery
• Minimal overhead beyond base processing
12. Why is it hard?
Stateful continuous operators tightly integrate
“computation” with “mutable state”
Makes it harder to define clear boundaries when
computation and state can be moved around
stateful
continuous
operator
mutable
state
input
records
output
records
13. Dissociate computation from state
Make state immutable and break computation into
small, deterministic, stateless tasks
Defines clear boundaries where state and computation
can be moved around independently
stateless
task
state 1
input 1
state 2
stateless
task
state 2
input 2
stateless
task
input 3
15. Batch Processing Systems
Batch processing systems like MapReduce divide
– Data into small partitions
– Jobs into small, deterministic, stateless map / reduce
tasks
M
M
M
M
M
M
M
M
M
M
M
M
stateless
map tasks
stateless
reduce tasks
R
R
R
R
R
R
immutable
input dataset
immutable
output datasetimmutable
map outputs
16. Parallel Recovery
Failed tasks are re-executed on the
other nodes in parallel
M
M
M
R
R
M
M
M
M
M
M
stateless
map tasks
stateless
reduce tasks
RR
R
R
MMM
M
M
M
R
R
immutable
input dataset
immutable
output dataset
18. Discretized Stream Processing
Run a streaming computation as a series of
small, deterministic batch jobs
Store intermediate state data in cluster memory
Try to make batch sizes as small as possible
to get second-scale latencies
19. Discretized Stream Processing
time = 0 - 1: batch operations
Input: replicated
dataset stored in
memory
Output or State:
non-replicated
dataset
stored in memory
input stream output / state stream
…
…
input
time = 1 - 2:
input
20. Example: Counting page views
Discretized Stream (DStream) is a sequence of
immutable, partitioned datasets
– Can be created from live data streams or by applying bulk,
parallel transformations on other DStreams
views = readStream("http:...", "1 sec")
ones = views.map(ev => (ev.url, 1))
counts = ones.runningReduce((x,y) => x+y)
creating a DStream
transformation
views ones counts
t: 0 - 1
t: 1 - 2
map reduce
22. • Lineage is used to recompute
partitions lost due to failures
• Datasets on different time
steps recomputed in parallel
• Partitions within a dataset
also recomputed in parallel
views ones counts
t: 0 - 1
t: 1 - 2
map reduce
t: 2 - 3
Parallel Fault Recovery
23. Upstream Backup
stream replayed serially
state
views ones counts
t: 0 - 1
t: 1 - 2
t: 2 - 3
Discretized Stream
Processing
parallelism
across time
intervals
parallelism
within a batchFaster recovery than upstream backup,
without the 2x cost of node replication
Comparison to Upstream Backup
24. How much faster than Upstream Backup?
Recover time = time taken to recompute and catch up
– Depends on available resources in the cluster
– Lower system load before failure allows faster recovery
Parallel
recovery with
5 nodes faster
than upstream
backup
Parallel
recovery with
10 nodes
faster than 5
nodes
25. Parallel Straggler Recovery
• Straggler mitigation techniques
– Detect slow tasks (e.g. 2X slower than other tasks)
– Speculatively launch more copies of the tasks in parallel
on other machines
• Masks the impact of slow nodes on the progress of
the system
27. Spark Streaming
• Implemented using Spark processing engine*
– Spark allows datasets to be stored in memory, and
automatically recovers them using lineage
• Modifications required to reduce jobs launching
overheads from seconds to milliseconds
[ *Resilient Distributed Datasets - NSDI, 2012 ]
28. How fast is Spark Streaming?
Can process 60M records/second on
100 nodes at 1 second latency
Tested with 100 4-core EC2 instances and 100 streams of text
28
0
1
2
3
4
0 50 100
ClusterThroughput(GB/s)
# Nodes in Cluster
WordCount
1 sec
2 sec
0
2
4
6
8
0 50 100
ClusterThhroughput(GB/s)
# Nodes in Cluster
Grep
1 sec
2 sec
Count the sentences
having a keyword
WordCount over 30 sec
sliding window
29. How does it compare to others?
Throughput comparable to other commercial
stream processing systems
29
System
Throughput per core
[ records / sec ]
Spark Streaming 160k
Oracle CEP 125k
Esper 100k
StreamBase 30k
Storm 30k
[ Refer to the paper for citations ]
30. 0.0
1.0
2.0
3.0
4.0
5.0
BatchProcessingTime(s)
30s ckpts, 20 nodes
30s ckpts, 40 nodes
10s ckpts, 20 nodes
10s ckpts, 40 nodes
How fast can it recover from faults?
Recovery time improves with more frequent
checkpointing and more nodes
Failure
Word Count over
30 sec window
31. How fast can it recover from stragglers?
0.55 0.54
3.02 2.40
1.00
0.64
0.0
1.0
2.0
3.0
4.0
WordCount Grep
BatchProcessingTime(s)
No straggler Straggler, no speculation
Straggler, with speculation
Speculative execution of slow tasks mask the
effect of stragglers
33. Unification with Batch and
Interactive Processing
• Discretized Streams creates a single programming
and execution model for running streaming, batch
and interactive jobs
• Combine live data streams with historic data
liveCounts.join(historicCounts).map(...)
• Interactively query live streams
liveCounts.slice(“21:00”, “21:05”).count()
34. App combining live + historic data
Mobile Millennium Project: Real-time estimation of
traffic transit times using live and past GPS observations
34
0
400
800
1200
1600
2000
0 20 40 60 80
GPSobservationsper
second
# Nodes in Cluster
• Markov chain Monte Carlo
simulations on GPS observations
• Very CPU intensive
• Scales linearly with cluster size
35. Recent Related Work
• Naiad – Full cluster rollback on recovery
• SEEP – Extends continuous operators to enable
parallel recovery, but does not handle stragglers
• TimeStream – Recovery similar to upstream backup
• MillWheel – State stored in BigTable, transactions
per state update can be expensive
36. Takeaways
• Discretized Streams model streaming computation
as series of batch jobs
– Uses simple techniques to exploit parallelism in streams
– Scales to 100 nodes with 1 second latency
– Recovers from failures and stragglers very fast
• Spark Streaming is open source - spark-project.org
– Used in production by ~ 10 organizations!
• Large scale streaming systems must
handle faults and stragglers
Notas do Editor
There are many big data applications that need to process large streams of data and process results in a near real time manner. For example, a website monitoring systems may want watch over websites for load spikes. A fraud detection system may want to monitor bank transaction in real time to detect fraud. An ad agency may want to count clicks in real time.
The common property across a lot of such systesm is that they require large clusters of tens to hundreds of nodes to handle the streaming load and they require an end to end latency of few seconds.
And an obvious problem with building large systems is that you have to deal with node failures
And not just failures but they must deal with stragglers as well. And its important for streaming systems to recover from failures and straggler quickly and efficiently. In fact it is more important for streaming systems than batch systems, because you don’t want your fraud detection system to be down for a long time. The problem is that traditions streaming systems don’t achieve both thse properties together.
In this talk, I am going to first discuss the limitations of traditional streaming systems.
Then I am going to elaborate about discretized stream processing.
Finally, I am also going to talk about how this unifies stream processing with batch and interactive processing.
Traditional stream processing use the continuous operator model, where every node in the processing pipeline continuously run an operator with in-memory mutable state. As each input records is received, the mutable state is updated and new records are sent out to downstream nodes. The problem with this model is that the mutable state is lost if the node fails. To deal with this ,various techniques have been developed to make this state fault-tolerant. I am going to divide them into two broad classes and explain their limitations.
First, systems like Borealis and Flux have used Node Replication, where a separate set of nodes process the same stream along with the main nodes. These nodes act as hot failover nodes.
Synchronization protocols ensure that both sets of nodes process the records in exactly the same order.
When a node fails, the system immediately switches over to the failover nodes, and processing continues with little or no disruption.
This results in fast recovery but it comes at the cost of double hardware. At large scale that may not be desirable. You may not want to run 200 nodes to handle the processing of 100 nodes.
The second method is called upstream backup, where each maintains a back up of all the records it has forwarded to the down stream node,
until the downstream node acknwoledge that it has done processing them, and checkpointed the state, etc.
A cold failover node is maintained and when a node fails, the system replays the backed up records serially to the failover node to recreate the lost state.
While this only requires one standby node, it is slow to recover because of recomputation. So you can see that either you can get fast recover at high cost, or slow recovery for cheap.
Furthermore, let us consider what happens when there are stragglers. Suppose this node happily decides to slow down.
Then this causes the downstream nodes to slowdown as well. And in case of replicatoin, since the replicas need to be in sycn, they effectively slow down as well.
So neither approach handles stragglers.
So we want to build a streaming system with the following ambitious goals.
Besides scaling hundreds of nodes with second-scale latencies, the system should tolerate faults and stragglers.
In particular, it should recover from faults and straggler within seconds, which is hard for upstream backup to do.
And it should use minimal extra hardware beyond normal processing, unlike node replication.
So we want to build a streaming system with the following ambitious goals.
Besides scaling hundreds of nodes with second-scale latencies, the system should tolerate faults and stragglers.
In particular, it should recover from faults and straggler within seconds, which is hard for upstream backup to do.
And it should use minimal extra hardware beyond normal processing, unlike node replication.
So we accepted the challenge and decided to understand why it is hard.
That is because the continuous operator model tightly integrates computation with mutable state.
This makes it harder to define clear boundaries in the computation and state such that they can be move around.
Instead what we propose is to dissociate the computation from state. that is, make the state immutable, and break the computation into smaller deterministic and stateless tasks.
So to do stateful operation, the system would take the previous state as input to the stateless task and generate the next state as output. And this continues with more tasks.
This defines clear boundaries where state and computation can be moved around independently.
Now this sound very abstract but the interesting this is that existing systems already do this.
Systems like MapReduce to divide the data into smaller partitions, and the job into small deterministic and stateless map reduce tasks. Lets walk through a map reduce job. You start from an immutable dataset already divided into partitions. For each partition, the system runs a bunch of stateless map tasks which generate immutable map outputs. Then they are shuffled together and stateless reduce tasks generate the output immutable dataset
And because these tasks are stateless they can be moved around easily. For example, if a node fails, and some maps tasks are lost and some reduce tasks are lost as well, and some partitions in the output were not generated. The system can automatically re run the map tasks in parallel on the other nodes. And then also the reduce tasks, finally generating the ncessary partitions.
Lets see how we use this parallel recovery property in our streaming processing model that we call Discretized Stream Processing.
Lets see how to program against this model. The programming abstraction that we provide is called Dstream, short for Discretized streams, which is a sequence of immutable partitioned datasets. These Dstreams can either be created from the live data stream, as shown in the program. readStream creates a dstream from live data stream of page views. Or Dstream can be created by applying bulk parallel transformations on other Dstreams, as shown in the program by the map and running reduce. This program maintains a running count of the page views. The underlying datasets look like this. as you can see the counts, the blue datasets are the state datasets that addup the counts across batches.
The system keeps track of the fine-grained partition level lineage. Now we don’t want the lineage to grow too long as that would cause a lot of recopmutation. so we periodically checkpoint some state rdds to stable storage and cut off the lineage. Note that these datasets are immutable, so checkpointing can be done asynchronously without blocking the computation.
This lineage is used to recover lost data partitions in parallel. For example, suppose some nodes die a few partitions here and there gets lost. Now since these datasets are in different time intervals and don’t depend on each other, hence they will be recomputed in parallel by running the map-reduce tasks that generated them. Similarly, partitions within the same datasets will also be recomputed in parallel. Now if you compare this to upstream backup…
Upstream backup replays the whole stream serially to one node to recreate the state that was lost. however, in our model, the lineage exposes all the stuff that can recomputed in parallel. Specifically it exposes parallelism within a dataset as well as across time, that is datasets in different time intervals. This ensures faster recovery than upstream backup without incurring the double cost of node replication.
How much faster can we be from upstream backup. For this we did a theoretical anayslis. We define the recovery time as the time taken to recompute the lost data and catch up with the live stream. This naturally depends on the available resources in the cluster, which in turn depends on the system load prior to the failure. The graph shows recovery time with respect to different system loads, and naturally lower system load leads to faster recovery. Note that parallel recovery with 5 nodes is significantly fastter than upstream backup. And also, parallel recovery with 10 nodes is faster than 5 nodes. This is because with 10 nodes, there is more parallelism to be used for fault recovery. This is another very cool property of this model, that larger clusters allow you recover faster.
The same principles can be used to recover from stragglers as well. Straggler mitigation techniques in batch processing typically do the following. First they detect slow tasks or straggling task by some criteria – for example, say tasks running more than 2 times slower than other tasks in the same stage. These tasks are then specula6ively rexecuted by running more copies of them in parallel on the other machines. Whichever finishes first wins. Since these tasks are stateless, and the output immutable, running multiple copies of them and letting them race to the finish is semantically equivalent. This essentially masks the impact of slow nodes on the progress of the system.
To evaluate this model, we built Spark Streaming. It is built using the Spark processing engine. For those who are not familiar with Spark, it is a super fast batch processing engine that allow datasets to be stored in memory and automatically recovers them using lineage. We had to make signification modification to Spark to reduce job launching overhead from seconds to milliseconds in order to achieve sub-second end-to-end latencies.
Spark Streaming can process 60 million records per second on 100 nodes with an end-to-end latency of 1 second. This was tested using 100 4 core EC2 instances and 100 streams of text. The workload was Grep in which we count the number of sentences having a keyword. We also tried a stateful workload WordCount over a 30 second sliding window, and the throughput was slightly lower because of increased computation but still in the ballpark range of 10s of millions of records per second. Note that in both cases, the throughput of the cluster scale linearly with the size of the cluster. Furthermore if you allow a 2 second latency, then the throughput goes up a little bit more.
If you compare with other commercial streaming systems, Spark Streaming has comparable per-core performance to numbers reported by Oracle CEP, Esper and StreamBase. Note that these 3 are single node systems, not distributed.
We test how we fast we can recover from faults. This graph shows the time take process batches before and after failure. Obviously right after failure the
We also tested how we perform with stragglers by slowing down a node in the system. The graph shows that without the straggler the processing time for both workloads were about half a second. With straggler it increased to 2 to 3 seconds. But speculative execution it came back to levels comparable to the earlier half second. This shows that speculative execution can mask the effect of straggler. No other system to our knowledge has shown this kind of evaluation.
Naiad, the paper you are going to hear about next, takes a completely different approach to unification different processing models, without sacrificing latency. Which is great. But in terms of fault tolerance the state is synchronuous checkpointed compared to our asynchronous checkpointing. And in case of failure, all the nodes in the cluster has to rollback to their previous checkpoints, which can be very costly. SEEP attempts to extend the continuous operators to enable parallel recovery of their state, but it requires invasive rewriting of the operators. In our case we were simply able