Kostas Kloudas presented on stateful stream processing with Apache Flink. He discussed how Flink handles state management, fault tolerance, and time semantics to allow for continuous and accurate processing of streaming data. Flink embeds local state with keyed streams, takes consistent snapshots of distributed state, and uses watermarks to process events in event time to produce correct results even for out-of-order data. This allows Flink to provide a robust stream processing engine that scales to large deployments.
8. ▪ So, for a simple counting program:
• Custom logic for handling state
• Custom logic for handling time
• Custom logic for fault tolerance
8
The ol’ traditional batch way
9. ▪ So, for a simple counting program:
• Custom logic for handling state
• Custom logic for handling time
• Custom logic for fault tolerance
9
The ol’ traditional batch way
Difficult and has nothing to do with your
program.
10. Why should we care?
▪...this is just for continuous data, right?
10
11. Why should we care?
▪...this is just for continuous data, right?
11
Most datasets are
continuously arriving streams.
16. A practical stream processor
16
state
●Fault-tolerance
●Scalability
●Efficiency
●Event-time
(out-of-order events)
●Allows you to work in
event-time (e.g. timers)
time
17. 17
Stateful Stream Processor
that handles
consistently, robustly, and efficiently
Large
Distributed State
Time / Order /
Completeness
● Stateful stream processing as
a new paradigm to
continuously process
continuously arriving data
● Produce accurate results
● Real-time is only a natural
consequence of the model
A practical stream processor
18. This is where Flink shines...
▪ Supports out-of-order streams
▪ Manages state transparently
• exactly-once processing
▪ Offers high throughput and low latency
▪ Scales to large deployments
• https://data-artisans.com/blog/blink-flink-alibaba-search
• https://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king
18
23. Event Time: Watermarks
23
● Special markers,
called Watermarks
● Flow with elements
● A watermark of
timestamp t means
that no records with
timestamp < t should
be expected
27. Fault tolerance simple case
27
event log
single process
main memoryperiodically take a
Snapshot of the memory
28. 28
event log
single process
main memoryRecovery
restore snapshot and replay
events since snapshot
persists events
(temporarily)
Fault tolerance simple case
29. Fault tolerance distributed
▪ How to create consistent snapshots of
distributed state?
▪ How to do it efficiently?
29
41. Apache Flink Stack
41
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
42. Levels of abstraction
42
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic tables)
Stream SQL
low-level (stateful
stream processing)
stream processing &
analytics
declarative DSL
high-level langauge
48. TL;DR
▪ Stateful stream processing as a paradigm for
continuous data processing
▪ Flink is a sophisticated and tested stateful stream
processor
▪ Efficiency, management, and operational issues for
state are taken very seriously
48
50. 50
Stream Processing
and Apache Flink®'s
approach to it
@StephanEwen
Apache Flink PMC
CTO @ data ArtisansFLINK FORWARD IS COMING BACK TO BERLIN
SEPTEMBER 11-13, 2017
BERLIN.FLINK-FORWARD.ORG -