Delta Lake Streaming: Under the Hood

Delta Lake Streaming :
Under the hood
Structured Streaming Internals

Speaker
Shasidhar Eranti
Senior Resident Solutions Engineer
Databricks

Sample stream
spark
.readStream
.format("delta")
...
.writeStream
.format("delta")
.table("delta_stream")
.option("checkpointLocation"
, "…")
.start()
Process Process
Checkpoint
Time

Physical Components of Delta Stream
Process
Checkpoint
1.Source 2.Sink
3.Checkpoint
4.Transaction Log

Focus area
Process
Checkpoint
Checkpoint
Transaction Log

How do we deep dive?
1. Structured Streaming Internals
2. Delta Table properties
3. Common pitfalls & mitigation strategies
Checkpoint
Transaction Log

What we will deep dive?
1. Structured Streaming Internals
2. Delta Table properties
3. Common pitfalls & mitigation strategies
Checkpoint
Transaction Log

Structured Streaming internals

Structured Streaming internals
▪ Query Progress Logs(QPL)
▪ Streaming semantics with Delta Lake
▪ Streaming Checkpoint
With Delta as Source & Sink

Query Progress Log
Structured Streaming

Query Progress Log (QPL)
• JSON log generated by every microbatch
• Provides batch execution details in metrics
• Used to display streaming dashboard in notebook cells

Query progress log
● Microbatch execution
● Source/Sink
● Stream performance
● Batch duration
● Streaming state
Metrics categories

Metrics Categories
Batch Execution metrics
JSON
"id" : "f87419cf-e92c-4d8a-b801-0ac1518da5e6",
"runId" : "d7e7fe6b-6386-4276-a936-2485a1522190",
"name" : simple_stream,
"batchId" : 1,
Key metrics
id
● Stream unique id
● Maps to checkpoint directory
batchId
● Microbatch Id
Notebook UI

Metrics Categories
Delta source and sink metrics
Key metrics
startOffset/endOffset
Set of metrics at which batch started and ended
● reservoirVersion
○ version of delta table at which current
stream execution started
● Index
○ ﬁle index of the transaction
● isStartingVersion
○ true , if stream starts from this reservoir
version
numInputRows
Count of rows ingested in a microbatch
sources" : [ {
"description" : "DeltaSource[dbfs:/user/hive/..]",
"startOffset" : {
"sourceVersion" : 1,
"reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a",
"reservoirVersion" : 3,
"index" : 2,
"isStartingVersion" : true
},
"endOffset" : {
"reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a",
"index" : 3,
}
"numInputRows" : 1,
"sink" : {
"description" : "DeltaSink[dbfs:/user/hive/warehouse/..]",
"numOutputRows" : -1
}
JSON

Metrics Categories
Performance metrics
JSON
"inputRowsPerSecond" : 0.016666666666666666,
"processedRowsPerSecond" : 0.2986857825567503
Key metrics
InputRowsPerSecond
● The rate at which data is arriving from this
source into stream
● Doesn’t represent the ingestion rate at source
processedRowsPerSecond
● The rate at which data from this source is being
processed by Spark
● processing rate > input rate, indicates stream
slowness
Notebook UI

Streaming semantics
with Delta Lake

Sample Delta Lake stream
structured
streaming
spark
.readStream
.format("delta")
.option("maxFilesPerTrigger",5)
.table("delta_keyval")
...
.writeStream
.format("delta")
.trigger("60 seconds")
.option("checkpointLocation",
"…")
.start()
source sink
source
sink

spark
.readStream
.format("delta")
...
.writeStream
.format("delta")
"…")
.start()
source
sink
● Number of ﬁles ingested per
microbatch
● If not speciﬁed stream will
fallback to default value (1000)
● Always affects all delta streams

spark
.readStream
.format("delta")
...
.writeStream
.format("delta")
"…")
.start()
source
sink
● Using Processing time trigger
● Default - No trigger
● Microbatch mode

Delta source - Stream Mapping
Source Table Details
0 1 2 3 4 5 6 7
version 0
Data Table History

Delta to Delta Stream
Source table
spark
.readStream
.format("delta")
.writeStream
.format("delta")
.table("delta_kv_stream")
.option("checkpointLocation", "…")
.start()
Stream Destination Table

Delta to Delta Stream
Source table history
"runId" : "324f1e17-4fae-4e1a-..."
"batchId" : 0,
...
"sources" : [ {
"description" : "DeltaSource[...]",
"startOffset" : null,
"endOffset" : {
"reservoirId": "744f0c51-48e6-482d-...",
"index" : 4,
}
"numInputRows" : 10240,
],
Query Progress log for ﬁrst
batch
Destination Table History
after ﬁrst batch

startOffset endOffset
num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
null 0 4 true 10240
First batch (BatchId 0)
Source history
QPL for batch 0
0 1 2 3 4 5 6 7
version 0
ﬁle0 to ﬁle4 is processed in
batch 0, since
maxFilesPerTrigger is set to 5

num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
0 4 true 0 7 true 6144
Second batch (BatchId 1)
Source history
QPL for batch 1
0 1 2 3 4 5 6 7
version 0
ﬁle5 to ﬁle7 is processed in
batch 1

num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
0 7 true 0 7 true 0
Third batch (BatchId 2)
Source history
QPL for batch 2
0 1 2 3 4 5 6 7
version 0
No ﬁles left for processing
in batch 2

num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
0 7 true 1 4 false 10240
Third batch (BatchId 2 with new data)
Source history
QPL for batch 2
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
version 0 version 1

num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
0 7 true 1 4 false 10240
Third batch (BatchId 2 with new data)
Source history
QPL for batch 2
ﬁle0 to ﬁle4 from version1
is processed in batch 2
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
version 0 version 1

num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
2 -1 false 2 -1 false 0
Last batch (BatchId 4)
Source history
QPL for batch 4
No new ﬁles left for
processing in batch 4
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
version 0 version 1
0 1..
version 2 ..
Next Possible version

Streaming Stages
STRUCTURED
STREAMING
Checkpoint
Simple Stream Different Steps Status in notebook
Step 1
Construct Microbatch
● Fetch source offsets
● Commit offsets
Step 2
Process Microbatch
Step 3
Commit Microbatch

Streaming Checkpoint
● Tracks Streaming query
● All data is stored as JSON
● Contents
● offsets
● metadata
● commits
Step 2
Process Microbatch
Step 3
Commit Microbatch
Step 1

Streaming Checkpoint - Offsets
● One ﬁle per microbatch
● Offset ﬁle is generated when batch starts
● Following details are stored
a. Batch and streaming state details
b. Source details
Step 1

Streaming Checkpoint - Offset mapping
dbfs:/mnt/stream/simple_stream/checkpoint/offsets/0
filename
"startOffset" : null,
"endOffset" : {
"reservoirId": "483e5927-1c26-4af5-b7a2-",
"index" : 4,
}
content
query
progress
log
file0 to file4 is processed in batch 0 (first batch)
Step 1

Streaming Checkpoint - Metadata
● Metadata of the stream
● Stream-id is generated when stream starts
● Remain same for the lifetime of checkpoint directory

● One file per microbatch is generated
● File is generated only if batch completes
● Query restarts check num commit files == num of offset files
○ True -> Start a new batch
○ False -> Finish previously batch
Streaming Checkpoint - Commits Step 3
Commit Microbatch

Summary
QPL is the source
of truth for stream
execution
Delta log <-> QPL Mapping
explains stream execution
Delta Streaming
Query Progress Logs
Offsets and Commits are key
to maintain stream state
Stream Checkpoint

What Next ??
● Other variations of
Delta stream
○ (TriggerOnce, maxBytesPerTrigger)
● Delta table properties
● Common pitfalls &
mitigation strategies

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Delta Lake Streaming: Under the Hood

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Delta Lake Streaming: Under the Hood

Semelhante a Delta Lake Streaming: Under the Hood (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Delta Lake Streaming: Under the Hood