Trident: abstraction on top of Storm. Besides providing higher-level constructs “a-la-Cascading”, it batches groups of Tuples to 1) Make reasoning about processing easier and 2) Encourage
efficient data persistence, even with the help of an API that can provide exactly-once semantics for some cases
Heron: built since 2014, paper in 2015, open-source in May 2016. http://twitter.github.io/heron/. API compatible with Apache Storm and hence no code change
“One of our primary requirements for Heron was ease of debugging and profiling”, also scheduling, optimal resource utilization (IPC layer, simplification)
Flink: based on distributed ckpt, Lightweight Asynchronous Snapshots for Distributed Dataflows, (ABS: Asynchronous Barrier Snapshotting ) http://arxiv.org/abs/1506.08603
variation of the Chandy Lamport algorithm (1985). periodically draws state snapshots of a running stream topology, and stores these snapshots to durable storage
Similar to the micro-batching approach, in which all computations between two checkpoints either succeed or fail atomically as a whole. However, the similarities stop there. One great feature of Chandy Lamport is that we never have to press the “pause” button in stream processing to schedule the next micro batch. Instead, regular data processing always keeps going, processing events as they come, while checkpoints happen in the background
If failed detected, Storm can re-do from beginning (Storm doesn’t do ckpt) - usually fast in ms level.
Spark can re-do from the most recent ckpt (perf impact).
Task failed: by supervisor daemon restart
Supervisor/workNode failed: by ZK
Restart/re-scheduler
Master failed: by ZK. Cant’ submit new task
Existing task should be ok
Redo(Re-compute): no log/replica, for high performance or real-time processing
It doesn’t care which component is failed. Once failed is detected given a time-out (30sec)
App should not commit the message to data source like Kafka, then Kafka never remove that data
App could re-send the message and re-run the topology
Random ID
Every bolt must send a Ack message
Another benchmark is IBM
IBM InfoSphere vs. Storm: https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf
In practice, there’s a challenge to implement such approach: ideally,
Need to know how many downstream msg are generated,
then alloc enough random IDs and calculate the FP
For each downstream msg, embed the FP and emit to downstream
However, for many (may be not all) processing logic, probably don’t know the total downstream msg count beforehand (in step1) until execute the logic.
In practice, there’s a challenge to implement the approach: ideally,
Need to know how many downstream msg are generated,
then alloc enough random IDs and calculate the FP
For each downstream msg, embed the FP and emit to downstream
However, for many (may be not all) processing logic, probably don’t know the total downstream msg count beforehand (in step1) until execute the logic.
In practice, there’s a challenge to implement such approach: ideally,
Need to know how many downstream msg are generated,
then alloc enough random IDs and calculate the FP
For each downstream msg, embed the FP and emit to downstream
However, for many (may be not all) processing logic, probably don’t know the total downstream msg count beforehand (in step1) until execute the logic.
For example, init share is 100 @ Acker. Embed the share into msg and pass-down to the downstream,
A source msg (root msg) is ingested at root node (spout), then init the BIG SHARE as initial status. And embed the SHARE as part of metadata
Run topology, and each node execute pre-defined logic, meanwhile, also abstract the share and split it to downstream outputMsg
Finally at leaf nodes, would abstract and report the received share to Acker
Acker would decrease the share, 100 - 16 -84 = 0. 0 means ok.
May pre-define some rule about inc, i.e., always increase 7B, then Acker could use one bit to indicate one increase
A similar but different Huang’s algorithm. looks both use number as weight or share then involve split op, but sounds to me, the problem area, prerequisite, algorithm steps are very different. Huang’s target is more related to process (task/bolt) state, but my target is the continuous flowing message running at tasks. A few bullets in my mind, feel free to comment:
Problem area: In Huang’s context, the distributed task consists of different processes, each in either active (may idel at anytime) or idle (idle to active is only triggered by some msg). Huang’s goal is to detect *all processes* in the system become state idle. our goal is to track each message status running at those task or usually related to partial failure (but we don’t care which task is failed/unavailable)prerequisite: importantly, state of idle (Huang’s monitoring state) clearly *is explicit aware* by the process; with that, his step is “Upon becoming idle, a process sends a message… “; but in our case, message failure/exception is hard to know by itself, typically due to network partition/timeout etc, thus it must be detected by other components or special design state, which adds extra challenges.into the algorithm: steps are different, our method always split the number during flow the DAG, then the Acker essentially redo split op based on recv share and make sure redo result is 0.
In general, Huang’s research target is process (tens or hundreds), rather than the continuous flowing message (billions or never stop). In practices, currently distributed process states are managed by Zookeeper(or raft etc) that based on Paxos algorithm publish in 1990-but widely understood and adopted after 2001 (until Lamport’s second paper to explain Paxos, and Google validated)
A few important points in implementation:
Re-use existing Anchors-to-ids map to embed the share when emit (so no extra traffic), previously it’s [RootId -> tupleID]; now it’s [RootID -> shareAssigned]
2. To split pass-down share, need to know how many downstream outMsg generated beforehand (but usually hard to predict),
To resolve that, work out the 1-step defer processing:
1) Static split the input share = sub-share
2) Assign and embed, prepare to emit
3) Internally queue current outMsg, and send previous msg
4) ToEmit the last msg with new API, thus the last outMsg takes over all the left share
w/ above implem, we introduce a little bit delay but it’s acceptable.
3. How to split share is also important. Right now it’s simply pre-defined split method, i.e., all bolt uses a pre-defined split count (could be 1 ~ 4096 or larger);
in the future, it shall be config per bolt by Dev (who suppose knows more on the topology). i.e., bolt1 may split upto 128; bolt2 may split as 256 etc.
Improper split may cause pressure to run out of ID thus increase share is required - still depends on topology size
IBM InfoSphere vs. Storm: https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf
Various topology, such as top-down, multiple income bolt, multiple spouts, …