With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
6. How do we deep dive?
1. Structured Streaming Internals
2. Delta Table properties
3. Common pitfalls & mitigation strategies
Checkpoint
Transaction Log
7. What we will deep dive?
1. Structured Streaming Internals
2. Delta Table properties
3. Common pitfalls & mitigation strategies
Checkpoint
Transaction Log
11. Query Progress Log (QPL)
• JSON log generated by every microbatch
• Provides batch execution details in metrics
• Used to display streaming dashboard in notebook cells
13. Metrics Categories
Batch Execution metrics
JSON
"id" : "f87419cf-e92c-4d8a-b801-0ac1518da5e6",
"runId" : "d7e7fe6b-6386-4276-a936-2485a1522190",
"name" : simple_stream,
"batchId" : 1,
Key metrics
id
● Stream unique id
● Maps to checkpoint directory
batchId
● Microbatch Id
Notebook UI
14. Metrics Categories
Delta source and sink metrics
Key metrics
startOffset/endOffset
Set of metrics at which batch started and ended
● reservoirVersion
○ version of delta table at which current
stream execution started
● Index
○ file index of the transaction
● isStartingVersion
○ true , if stream starts from this reservoir
version
numInputRows
Count of rows ingested in a microbatch
sources" : [ {
"description" : "DeltaSource[dbfs:/user/hive/..]",
"startOffset" : {
"sourceVersion" : 1,
"reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a",
"reservoirVersion" : 3,
"index" : 2,
"isStartingVersion" : true
},
"endOffset" : {
"sourceVersion" : 1,
"reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a",
"reservoirVersion" : 3,
"index" : 3,
"isStartingVersion" : true
}
"numInputRows" : 1,
"sink" : {
"description" : "DeltaSink[dbfs:/user/hive/warehouse/..]",
"numOutputRows" : -1
}
JSON
15. Metrics Categories
Performance metrics
JSON
"inputRowsPerSecond" : 0.016666666666666666,
"processedRowsPerSecond" : 0.2986857825567503
Key metrics
InputRowsPerSecond
● The rate at which data is arriving from this
source into stream
● Doesn’t represent the ingestion rate at source
processedRowsPerSecond
● The rate at which data from this source is being
processed by Spark
● processing rate > input rate, indicates stream
slowness
Notebook UI
18. Sample Delta Lake stream
spark
.readStream
.format("delta")
.option("maxFilesPerTrigger",5)
.table("delta_keyval")
...
.writeStream
.format("delta")
.table("delta_stream")
.trigger("60 seconds")
.option("checkpointLocation",
"…")
.start()
source
sink
● Number of files ingested per
microbatch
● If not specified stream will
fallback to default value (1000)
● Always affects all delta streams
19. Sample Delta Lake stream
spark
.readStream
.format("delta")
.option("maxFilesPerTrigger",5)
.table("delta_keyval")
...
.writeStream
.format("delta")
.table("delta_stream")
.trigger("60 seconds")
.option("checkpointLocation",
"…")
.start()
source
sink
● Using Processing time trigger
● Default - No trigger
● Microbatch mode
20. Delta source - Stream Mapping
Source Table Details
0 1 2 3 4 5 6 7
version 0
Data Table History
32. Streaming Checkpoint
● Tracks Streaming query
● All data is stored as JSON
● Contents
● offsets
● metadata
● commits
Step 2
Process Microbatch
Step 3
Commit Microbatch
Step 1
Construct Microbatch
33. Streaming Checkpoint - Offsets
● One file per microbatch
● Offset file is generated when batch starts
● Following details are stored
a. Batch and streaming state details
b. Source details
Step 1
Construct Microbatch
35. Streaming Checkpoint - Metadata
● Metadata of the stream
● Stream-id is generated when stream starts
● Remain same for the lifetime of checkpoint directory
36. ● One file per microbatch is generated
● File is generated only if batch completes
● Query restarts check num commit files == num of offset files
○ True -> Start a new batch
○ False -> Finish previously batch
Streaming Checkpoint - Commits Step 3
Commit Microbatch
37. Summary
QPL is the source
of truth for stream
execution
Delta log <-> QPL Mapping
explains stream execution
Delta Streaming
Query Progress Logs
Offsets and Commits are key
to maintain stream state
Stream Checkpoint
38. What Next ??
● Other variations of
Delta stream
○ (TriggerOnce, maxBytesPerTrigger)
● Delta table properties
● Common pitfalls &
mitigation strategies