Principles in Data Stream Processing | Matthias J Sax, Confluent

Principles in
Data Stream Processing
Nick Dearden | Matthias J. Sax

Outline
2
• What do we mean by Stream Processing anyway ?
• Concepts
• Events & State
• Time
• (In)Completeness
• Future directions

Events vs State – Streams vs Tables
4
𝑠𝑡𝑎𝑡𝑒 𝑛𝑜𝑤 = )
!"#
$%&
𝑠𝑡𝑟𝑒𝑎𝑚 𝑡 d𝑡
𝑠𝑡𝑟𝑒𝑎𝑚 𝑡 =
d𝑠𝑡𝑎𝑡𝑒(𝑡)
d𝑡
* M. Kleppman, Designing Data-Intensive Applications

Event vs State
Over
Batch vs Stream

Unifying Batch and Stream Processing?
7
No! Approach this from a different direction.
• Batch vs Stream is not just about “processing latency”, but fundamentally about semantics!
• What they have in common is that they both transform data
Instead of focusing on “batch” vs “incremental” data transformation, stream processing is about “data in motion”
in contrast to “data at rest”
• And the goal is to UNIFY data in motion and data at rest with a unique abstraction
We think of this as integrating state and events
• Streams and Tables in a unified processing model and programming interface / query language

Events and ”the Log”
8
Immutability as core concept
Use a log to store events
• Logs store immutable data
• Logs preserve order
Think stream-table join:, an input event is enriched to produce an output event
• After the output event is produced, it’s immutable;
• Thus, if the table is updated, the enriched event is not updated

9
Company City
Confluent PA
Oracle RWC

10
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC

11
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC

12
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC
CFLT 10 Laptops PA

13
CFLT 3 Router
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC
CFLT 3 Router
CFLT 10 Laptops PA PA

14
CFLT 3 Router
CFLT 10 Laptops
Company City
Confluent PA
Oracle RWC
Company City
Confluent MV
Oracle RWC
CFLT 3 Router
CFLT 10 Laptops PA PA

15
CFLT 3 Router
CFLT 10 Laptops CFLT 20 Mice
Company City
Confluent PA
Oracle RWC
Company City
Confluent MV
Oracle RWC
CFLT 3 Router
PA MV
PA

Stream Processing and Time Semantics
17
Time is a first-class citizen
Stream processing engine needs to reason about time (not just a data dimension).
Batch processing ignores the time-dimension
User’s responsibility to reason about time – the processing engine does not understand time semantics.
Event Time
When did something happen (e.g. “click on ad”).
Events and State
Tables need a time dimension, too!

Event Time vs Processing Time
18
Event Time
Every event carries a timestamp embedded in the record. Captures when the event happened.
• Stream processing semantics are defined on event time.
Processing Time
When we process the data (last week, today, tomorrow).
• Must not impact the result.

Tables and Time Semantics
19
Classic DB systems don’t track time nor “versions”
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23

Tables and Time Semantics (cont.)
20
Manually track time and versions
Or use newer SQL features like “system versioned temporal tables”.
KEY VALUE ValidFrom ValidTo
A 10 1 17
B 42 3 23
C 23 10 MAX_VALUE
A 20 17 40
B 73 23 MAX_VALUE
A <null> 40 MAX_VALUE

Versioned Tables
21
KEY VALUE
A 10
KEY VALUE
A 10
B 42
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
KEY VALUE
A 20
B 42
C 23
Event Time
t=1 t=3 t=10 t=17 t=23 t=40
A sequence of table versions over time:

t=5 t=20
Events and State in Time
22
CFLT 3 Router
Company City
Confluent PA
Oracle RWC
Company City
Confluent MV
Oracle RWC
CFLT 3 Router
PA MV
PA
t=10 t=15 t=30

Deterministic Processing
24
Log Order (aka offset order)
• messages are stored in append order (per partition)
• all consumers read message in the same order
• log can also be re-read multiple times, and the order of messages is always the same
Kafka’s strict ordering guarantees
allow to build consistent and deterministic application!

Log Order and Event Time
25
However
• stream processing semantics are defined on (logical) event time order
• infinite input and out-of-order data (event time) make the concept of completeness fuzzy
Consistency != Completeness

When is the Puzzle completed?
26
Limit “scope” via (time) windows
• Define puzzle “boundaries”
• How many puzzle pieces do we have?

When is which Puzzle (window) completed?
27
Event Time

Grace Period
28
Consistency != Completeness
Only you know when you are complete (or rather, how much out-of-order-ness you are prepared to tolerate).
Configurable grace period.
Eventual Completeness ?

Future of Stream Processing
30
Time-versioned tables:
CREATE TABLE versionedTable <schema> VERSION RETENTION TIME 25
SECONDS;

31
SECONDS;
KEY VALUE
A 10
KEY VALUE
A 10
B 42
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
KEY VALUE
A 20
B 42
C 23
Event Time
t=1 t=3 t=10 t=17 t=23 t=40

32
SECONDS;
KEY VALUE
A 10
KEY VALUE
A 10
B 42
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
KEY VALUE
A 20
B 42
C 23
Event Time
retention = 25
t=1 t=3 t=10 t=17 t=23 t=40

33
SECONDS;
KEY VALUE
A 10
KEY VALUE
A 10
B 42
KEY VALUE
A 10
B 42
C 23
KEY VALUE
A 20
B 42
C 23
KEY VALUE
A 20
B 73
C 23
KEY VALUE
A 20
B 42
C 23
Event Time
retention = 25
t=1 t=3 t=10 t=17 t=23 t=40

Future of Stream Processing (cont.)
34
Easier Reasoning about Completeness
• ksqlDB and Kafka Streams are distributed systems
• Vector clocks
DM32 Ad-hoc group on SQL extensions for streaming data
• Confluent, Oracle, IBM, Microsoft, Google, Alibaba

Learn More, Join In!
35
SIGMOD paper
https://www.confluent.io/resources/white-paper/distributed-
stream-processing-in-kafka
Apache Kafka Improvement Proposals (KIPs)
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improveme
nt+Proposals
Confluent blog
https://confluent.io/blog
Streaming Audio with Tim Berglund
https://developer.confluent.io/podcast/

Principles in Data Stream Processing | Matthias J Sax, Confluent

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Principles in Data Stream Processing | Matthias J Sax, Confluent

Semelhante a Principles in Data Stream Processing | Matthias J Sax, Confluent (20)

Mais de HostedbyConfluent

Mais de HostedbyConfluent (20)

Último

Último (20)

Principles in Data Stream Processing | Matthias J Sax, Confluent