Data stream processing is, for many of us, a new paradigm with which you process data and build applications. In this talk, we will take you on a journey through the theoretical foundations of stream processing and discuss the underlying principles and unique problems that need to be addressed. What actually is a data stream anyway? And how do I use it? How do streams relate to application state and when do I use the one or the other?
ksqlDB and Kafka Streams are both, at their core, designed to help build stream processing applications and we will explain how stream processing principles are reflected in the design of each system and what trade-offs were chosen (and - more importantly! - why). Finally, we take a look into the future how the stream processing space, and in particular ksqlDB and Kafka Streams, may evolve over the next few years as we outline extensions and improvements to the underlying conceptual model. So, bring your thinking hats and notepads and prepare to learn WHY these systems are the way they are!
7. Unifying Batch and Stream Processing?
7
No! Approach this from a different direction.
• Batch vs Stream is not just about “processing latency”, but fundamentally about semantics!
• What they have in common is that they both transform data
Instead of focusing on “batch” vs “incremental” data transformation, stream processing is about “data in motion”
in contrast to “data at rest”
• And the goal is to UNIFY data in motion and data at rest with a unique abstraction
We think of this as integrating state and events
• Streams and Tables in a unified processing model and programming interface / query language
8. Events and ”the Log”
8
Immutability as core concept
Use a log to store events
• Logs store immutable data
• Logs preserve order
Think stream-table join:, an input event is enriched to produce an output event
• After the output event is produced, it’s immutable;
• Thus, if the table is updated, the enriched event is not updated
10. Events and ”the Log”
10
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC
11. Events and ”the Log”
11
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC
12. Events and ”the Log”
12
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC
CFLT 10 Laptops PA
13. Events and ”the Log”
13
CFLT 3 Router
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC
CFLT 3 Router
CFLT 10 Laptops PA PA
14. Events and ”the Log”
14
CFLT 3 Router
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC
Company City
Confluent MV
Oracle RWC
CFLT 3 Router
CFLT 10 Laptops PA PA
15. Events and ”the Log”
15
CFLT 3 Router
CFLT 10 Laptops CFLT 20 Mice
Company City
Confluent PA
Oracle RWC
Company City
Confluent MV
Oracle RWC
CFLT 3 Router
CFLT 10 Laptops CFLT 20 Mice
PA MV
PA
17. Stream Processing and Time Semantics
17
Time is a first-class citizen
Stream processing engine needs to reason about time (not just a data dimension).
Batch processing ignores the time-dimension
User’s responsibility to reason about time – the processing engine does not understand time semantics.
Event Time
When did something happen (e.g. “click on ad”).
Events and State
Tables need a time dimension, too!
18. Event Time vs Processing Time
18
Event Time
Every event carries a timestamp embedded in the record. Captures when the event happened.
• Stream processing semantics are defined on event time.
Processing Time
When we process the data (last week, today, tomorrow).
• Must not impact the result.
19. Tables and Time Semantics
19
Classic DB systems don’t track time nor “versions”
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
20. Tables and Time Semantics (cont.)
20
Manually track time and versions
Or use newer SQL features like “system versioned temporal tables”.
KEY VALUE ValidFrom ValidTo
A 10 1 17
B 42 3 23
C 23 10 MAX_VALUE
A 20 17 40
B 73 23 MAX_VALUE
A <null> 40 MAX_VALUE
21. Versioned Tables
21
KEY VALUE
A 10
KEY VALUE
A 10
B 42
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
KEY VALUE
A 20
B 42
C 23
Event Time
t=1 t=3 t=10 t=17 t=23 t=40
A sequence of table versions over time:
22. t=5 t=20
Events and State in Time
22
CFLT 3 Router
CFLT 10 Laptops CFLT 20 Mice
Company City
Confluent PA
Oracle RWC
Company City
Confluent MV
Oracle RWC
CFLT 3 Router
CFLT 10 Laptops CFLT 20 Mice
PA MV
PA
t=10 t=15 t=30
24. Deterministic Processing
24
Log Order (aka offset order)
• messages are stored in append order (per partition)
• all consumers read message in the same order
• log can also be re-read multiple times, and the order of messages is always the same
Kafka’s strict ordering guarantees
allow to build consistent and deterministic application!
25. Log Order and Event Time
25
However
• stream processing semantics are defined on (logical) event time order
• infinite input and out-of-order data (event time) make the concept of completeness fuzzy
Consistency != Completeness
26. When is the Puzzle completed?
26
Limit “scope” via (time) windows
• Define puzzle “boundaries”
• How many puzzle pieces do we have?
27. When is which Puzzle (window) completed?
27
Event Time
28. Grace Period
28
Consistency != Completeness
Only you know when you are complete (or rather, how much out-of-order-ness you are prepared to tolerate).
Configurable grace period.
Eventual Completeness ?
30. Future of Stream Processing
30
Time-versioned tables:
CREATE TABLE versionedTable <schema> VERSION RETENTION TIME 25
SECONDS;
31. Future of Stream Processing
31
Time-versioned tables:
CREATE TABLE versionedTable <schema> VERSION RETENTION TIME 25
SECONDS;
KEY VALUE
A 10
KEY VALUE
A 10
B 42
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
KEY VALUE
A 20
B 42
C 23
Event Time
t=1 t=3 t=10 t=17 t=23 t=40
32. Future of Stream Processing
32
Time-versioned tables:
CREATE TABLE versionedTable <schema> VERSION RETENTION TIME 25
SECONDS;
KEY VALUE
A 10
KEY VALUE
A 10
B 42
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
KEY VALUE
A 20
B 42
C 23
Event Time
retention = 25
t=1 t=3 t=10 t=17 t=23 t=40
33. Future of Stream Processing
33
Time-versioned tables:
CREATE TABLE versionedTable <schema> VERSION RETENTION TIME 25
SECONDS;
KEY VALUE
A 10
KEY VALUE
A 10
B 42
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
KEY VALUE
A 20
B 42
C 23
Event Time
retention = 25
t=1 t=3 t=10 t=17 t=23 t=40
34. Future of Stream Processing (cont.)
34
Easier Reasoning about Completeness
• ksqlDB and Kafka Streams are distributed systems
• Vector clocks
DM32 Ad-hoc group on SQL extensions for streaming data
• Confluent, Oracle, IBM, Microsoft, Google, Alibaba
35. Learn More, Join In!
35
SIGMOD paper
https://www.confluent.io/resources/white-paper/distributed-
stream-processing-in-kafka
Apache Kafka Improvement Proposals (KIPs)
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improveme
nt+Proposals
Confluent blog
https://confluent.io/blog
Streaming Audio with Tim Berglund
https://developer.confluent.io/podcast/