The talk explains how Apache Flink checkpoints stateful jobs using the asynchronous barrier snapshotting algorithm to give exactly once semantics in streaming. Furthermore, Flink's approach to master high availability (HA) is described which solves the problem of the JobManager being the single point of failure. Job checkpointing in combination with HA is the basis for Flink's fault tolerance mechanism to recover from occurring failures.
3. Better be safe than sorry
§ Failures will happen
§ EMC estimated $1.7 billion costs due to
data loss and system downtime
§ Recovery will save you time and costs
§ Switch between algorithms
§ Live upgrade of your system
3
5. Fault tolerance guarantees
§ At most once
• No guarantees at all
§ At least once
• For many applications
sufficient
§ Exactly once
§ Flink provides all guarantees
5
14. Advantages
§ Separation of app logic from recovery
• Checkpointing interval is just a config
parameter
§ High throughput
• Controllable checkpointing overhead
§ Low impact on latency
14
31. TL;DL
§ Job recovery mechanism with low latency
and high throughput
§ Exactly one processing semantics
§ No single point of failure
è Flink will always keep processing
your data
31