Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays.
The fact that stream processing is gaining rapid adoption is also due to more powerful and maturing technology (much of it open source at the ASF) that has solved many of the hard technical challenges.
We discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
2. About me
Database systems, TU Berlin, IBM, Microsoft
Co-bootstrapped Stratosphere project's runtime
Apache Flink created from a (partial) Stratosphere fork
Apache Flink community founded data Artisans
Now Flink PMC and CTO at data Artisans
3. Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced
Hint: you already have streaming data
3
4. Streaming Subsumes Batch
4
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
partition
partition
8. Time Travel
8
Process a period of
historic data
partition
partition
Process latest data
with low latency
(tail of the log)
Reprocess stream
(historic data first, catches up with realtime data)
12. Flink's Approach
12
Stateful Steam Processing
Fluent API, Windows, Event Time
Table API
Stream SQL
Core API
Declarative DSL
High-level Language
Building Block
17. Versioning the state of applications
17
Savepoint
Savepoint
Savepoint
App. A
App. B
App. C
Time
Savepoint
18. Flink's Approach
18
Stateful Steam Processing
Fluent API, Windows, Event Time
Table API
Stream SQL
Core API
Declarative DSL
High-level Language
Building Block
19. Event Time / Out-of-Order
19
1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
20. (Stream) SQL & Table API
20
Table API
// convert stream into Table
val sensorTable: Table = sensorData
.toTable(tableEnv, 'location, 'time, 'tempF)
// define query on Table
val avgTempCTable: Table = sensorTable
.groupBy('location)
.window(Tumble over 1.days on 'rowtime as 'w)
.select('w.start as 'day, 'location,
(('tempF.avg - 32) * 0.556) as 'avgTempC)
.where('location like "room%")
SQL
sensorTable.sql("""
SELECT day, location,
avg((tempF - 32) * 0.556) AS avgTempC
FROM sensorData
WHERE location LIKE 'room%'
GROUP BY day, location
""")
21. What can you do with that?
21
10 billion events (2TB) processed daily across multiple
Flink jobs for the telco network control center.
Ad-hoc realtime queries, > 30 operators, processing
30 billion events daily, maintaining state of 100s of GB
inside Flink with exactly-once guarantees
Jobs with > 20 operators, runs on > 5000 vCores in
1000-node cluster, processes millions of events per
second
22. Flink's Streams playing at Batch
22
TeraSort
Relational Join
Classic Batch Jobs
Graph
Processing
Linear
Algebra
26. Full SQL on Streams
26
Continuous queries
incremental results
Windows, event time,
processing time
Consistent with SQL on bounded data
https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU
28. Very large state
28
Terabytes of state inside the stream processor
Maintaining fast checkpoints and recovery
E.g., long histories of windows, large join tables
State at local memory speed