SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Delta Lake Streaming :
Under the hood
Structured Streaming Internals
Speaker
Shasidhar Eranti
Senior Resident Solutions Engineer
Databricks
Sample stream
spark
.readStream
.format("delta")
...
.writeStream
.format("delta")
.table("delta_stream")
.option("checkpointLocation"
, "…")
.start()
Process Process
Checkpoint
Time
Physical Components of Delta Stream
Process
Checkpoint
1.Source 2.Sink
3.Checkpoint
4.Transaction Log
Focus area
Process
Checkpoint
Checkpoint
Transaction Log
How do we deep dive?
1. Structured Streaming Internals
2. Delta Table properties
3. Common pitfalls & mitigation strategies
Checkpoint
Transaction Log
What we will deep dive?
1. Structured Streaming Internals
2. Delta Table properties
3. Common pitfalls & mitigation strategies
Checkpoint
Transaction Log
Structured Streaming internals
Structured Streaming internals
▪ Query Progress Logs(QPL)
▪ Streaming semantics with Delta Lake
▪ Streaming Checkpoint
With Delta as Source & Sink
Query Progress Log
Structured Streaming
Query Progress Log (QPL)
• JSON log generated by every microbatch
• Provides batch execution details in metrics
• Used to display streaming dashboard in notebook cells
Query progress log
● Microbatch execution
● Source/Sink
● Stream performance
● Batch duration
● Streaming state
Metrics categories
Metrics Categories
Batch Execution metrics
JSON
"id" : "f87419cf-e92c-4d8a-b801-0ac1518da5e6",
"runId" : "d7e7fe6b-6386-4276-a936-2485a1522190",
"name" : simple_stream,
"batchId" : 1,
Key metrics
id
● Stream unique id
● Maps to checkpoint directory
batchId
● Microbatch Id
Notebook UI
Metrics Categories
Delta source and sink metrics
Key metrics
startOffset/endOffset
Set of metrics at which batch started and ended
● reservoirVersion
○ version of delta table at which current
stream execution started
● Index
○ file index of the transaction
● isStartingVersion
○ true , if stream starts from this reservoir
version
numInputRows
Count of rows ingested in a microbatch
sources" : [ {
"description" : "DeltaSource[dbfs:/user/hive/..]",
"startOffset" : {
"sourceVersion" : 1,
"reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a",
"reservoirVersion" : 3,
"index" : 2,
"isStartingVersion" : true
},
"endOffset" : {
"sourceVersion" : 1,
"reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a",
"reservoirVersion" : 3,
"index" : 3,
"isStartingVersion" : true
}
"numInputRows" : 1,
"sink" : {
"description" : "DeltaSink[dbfs:/user/hive/warehouse/..]",
"numOutputRows" : -1
}
JSON
Metrics Categories
Performance metrics
JSON
"inputRowsPerSecond" : 0.016666666666666666,
"processedRowsPerSecond" : 0.2986857825567503
Key metrics
InputRowsPerSecond
● The rate at which data is arriving from this
source into stream
● Doesn’t represent the ingestion rate at source
processedRowsPerSecond
● The rate at which data from this source is being
processed by Spark
● processing rate > input rate, indicates stream
slowness
Notebook UI
Streaming semantics
with Delta Lake
Sample Delta Lake stream
structured
streaming
spark
.readStream
.format("delta")
.option("maxFilesPerTrigger",5)
.table("delta_keyval")
...
.writeStream
.format("delta")
.table("delta_stream")
.trigger("60 seconds")
.option("checkpointLocation",
"…")
.start()
source sink
source
sink
Sample Delta Lake stream
spark
.readStream
.format("delta")
.option("maxFilesPerTrigger",5)
.table("delta_keyval")
...
.writeStream
.format("delta")
.table("delta_stream")
.trigger("60 seconds")
.option("checkpointLocation",
"…")
.start()
source
sink
● Number of files ingested per
microbatch
● If not specified stream will
fallback to default value (1000)
● Always affects all delta streams
Sample Delta Lake stream
spark
.readStream
.format("delta")
.option("maxFilesPerTrigger",5)
.table("delta_keyval")
...
.writeStream
.format("delta")
.table("delta_stream")
.trigger("60 seconds")
.option("checkpointLocation",
"…")
.start()
source
sink
● Using Processing time trigger
● Default - No trigger
● Microbatch mode
Delta source - Stream Mapping
Source Table Details
0 1 2 3 4 5 6 7
version 0
Data Table History
Delta source - Stream Mapping
Delta to Delta Stream
Source table
spark
.readStream
.format("delta")
.option("maxFilesPerTrigger",5)
.table("delta_keyval")
.writeStream
.format("delta")
.table("delta_kv_stream")
.option("checkpointLocation", "…")
.start()
Stream Destination Table
Delta source - Stream Mapping
Delta to Delta Stream
Source table history
"runId" : "324f1e17-4fae-4e1a-..."
"batchId" : 0,
...
"sources" : [ {
"description" : "DeltaSource[...]",
"startOffset" : null,
"endOffset" : {
"sourceVersion" : 1,
"reservoirId": "744f0c51-48e6-482d-...",
"reservoirVersion" : 0,
"index" : 4,
"isStartingVersion" : true
}
"numInputRows" : 10240,
],
Query Progress log for first
batch
Destination Table History
after first batch
startOffset endOffset
num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
null 0 4 true 10240
Delta source - Stream Mapping
First batch (BatchId 0)
Source history
QPL for batch 0
0 1 2 3 4 5 6 7
version 0
file0 to file4 is processed in
batch 0, since
maxFilesPerTrigger is set to 5
startOffset endOffset
num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
0 4 true 0 7 true 6144
Delta source - Stream Mapping
Second batch (BatchId 1)
Source history
QPL for batch 1
0 1 2 3 4 5 6 7
version 0
file5 to file7 is processed in
batch 1
startOffset endOffset
num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
0 7 true 0 7 true 0
Delta source - Stream Mapping
Third batch (BatchId 2)
Source history
QPL for batch 2
0 1 2 3 4 5 6 7
version 0
No files left for processing
in batch 2
startOffset endOffset
num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
0 7 true 1 4 false 10240
Delta source - Stream Mapping
Third batch (BatchId 2 with new data)
Source history
QPL for batch 2
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
version 0 version 1
startOffset endOffset
num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
0 7 true 1 4 false 10240
Delta source - Stream Mapping
Third batch (BatchId 2 with new data)
Source history
QPL for batch 2
file0 to file4 from version1
is processed in batch 2
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
version 0 version 1
startOffset endOffset
num
Input
Rows
reservoir
version
index
IsStarting
Version
reservoir
version
index
IsStarting
Version
2 -1 false 2 -1 false 0
Delta source - Stream Mapping
Last batch (BatchId 4)
Source history
QPL for batch 4
No new files left for
processing in batch 4
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
version 0 version 1
0 1..
version 2 ..
Next Possible version
Streaming Checkpoint
Streaming Stages
STRUCTURED
STREAMING
Checkpoint
Simple Stream Different Steps Status in notebook
Step 1
Construct Microbatch
● Fetch source offsets
● Commit offsets
Step 2
Process Microbatch
Step 3
Commit Microbatch
Streaming Stages
STRUCTURED
STREAMING
Checkpoint
Simple Stream Different Steps Status in notebook
Step 1
Construct Microbatch
● Fetch source offsets
● Commit offsets
Step 2
Process Microbatch
Step 3
Commit Microbatch
Streaming Checkpoint
● Tracks Streaming query
● All data is stored as JSON
● Contents
● offsets
● metadata
● commits
Step 2
Process Microbatch
Step 3
Commit Microbatch
Step 1
Construct Microbatch
Streaming Checkpoint - Offsets
● One file per microbatch
● Offset file is generated when batch starts
● Following details are stored
a. Batch and streaming state details
b. Source details
Step 1
Construct Microbatch
Streaming Checkpoint - Offset mapping
dbfs:/mnt/stream/simple_stream/checkpoint/offsets/0
filename
"startOffset" : null,
"endOffset" : {
"sourceVersion" : 1,
"reservoirId": "483e5927-1c26-4af5-b7a2-",
"reservoirVersion" : 0,
"index" : 4,
"isStartingVersion" : true
}
content
query
progress
log
file0 to file4 is processed in batch 0 (first batch)
Step 1
Construct Microbatch
Streaming Checkpoint - Metadata
● Metadata of the stream
● Stream-id is generated when stream starts
● Remain same for the lifetime of checkpoint directory
● One file per microbatch is generated
● File is generated only if batch completes
● Query restarts check num commit files == num of offset files
○ True -> Start a new batch
○ False -> Finish previously batch
Streaming Checkpoint - Commits Step 3
Commit Microbatch
Summary
QPL is the source
of truth for stream
execution
Delta log <-> QPL Mapping
explains stream execution
Delta Streaming
Query Progress Logs
Offsets and Commits are key
to maintain stream state
Stream Checkpoint
What Next ??
● Other variations of
Delta stream
○ (TriggerOnce, maxBytesPerTrigger)
● Delta table properties
● Common pitfalls &
mitigation strategies
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Mais conteúdo relacionado

Mais procurados

SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 

Mais procurados (20)

SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 

Semelhante a Delta Lake Streaming: Under the Hood

Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 

Semelhante a Delta Lake Streaming: Under the Hood (20)

Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Continuous SQL with Apache Streaming (FLaNK and FLiP)
Continuous SQL with Apache Streaming (FLaNK and FLiP)Continuous SQL with Apache Streaming (FLaNK and FLiP)
Continuous SQL with Apache Streaming (FLaNK and FLiP)
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
 
Harvesting the Power of Samza in LinkedIn's Feed
Harvesting the Power of Samza in LinkedIn's FeedHarvesting the Power of Samza in LinkedIn's Feed
Harvesting the Power of Samza in LinkedIn's Feed
 
Development of a Distributed Stream Processing System
Development of a Distributed Stream Processing SystemDevelopment of a Distributed Stream Processing System
Development of a Distributed Stream Processing System
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Online Approximate OLAP in SparkSQL
Online Approximate OLAP in SparkSQLOnline Approximate OLAP in SparkSQL
Online Approximate OLAP in SparkSQL
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 

Último (20)

Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 

Delta Lake Streaming: Under the Hood

  • 1. Delta Lake Streaming : Under the hood Structured Streaming Internals
  • 2. Speaker Shasidhar Eranti Senior Resident Solutions Engineer Databricks
  • 4. Physical Components of Delta Stream Process Checkpoint 1.Source 2.Sink 3.Checkpoint 4.Transaction Log
  • 6. How do we deep dive? 1. Structured Streaming Internals 2. Delta Table properties 3. Common pitfalls & mitigation strategies Checkpoint Transaction Log
  • 7. What we will deep dive? 1. Structured Streaming Internals 2. Delta Table properties 3. Common pitfalls & mitigation strategies Checkpoint Transaction Log
  • 9. Structured Streaming internals ▪ Query Progress Logs(QPL) ▪ Streaming semantics with Delta Lake ▪ Streaming Checkpoint With Delta as Source & Sink
  • 11. Query Progress Log (QPL) • JSON log generated by every microbatch • Provides batch execution details in metrics • Used to display streaming dashboard in notebook cells
  • 12. Query progress log ● Microbatch execution ● Source/Sink ● Stream performance ● Batch duration ● Streaming state Metrics categories
  • 13. Metrics Categories Batch Execution metrics JSON "id" : "f87419cf-e92c-4d8a-b801-0ac1518da5e6", "runId" : "d7e7fe6b-6386-4276-a936-2485a1522190", "name" : simple_stream, "batchId" : 1, Key metrics id ● Stream unique id ● Maps to checkpoint directory batchId ● Microbatch Id Notebook UI
  • 14. Metrics Categories Delta source and sink metrics Key metrics startOffset/endOffset Set of metrics at which batch started and ended ● reservoirVersion ○ version of delta table at which current stream execution started ● Index ○ file index of the transaction ● isStartingVersion ○ true , if stream starts from this reservoir version numInputRows Count of rows ingested in a microbatch sources" : [ { "description" : "DeltaSource[dbfs:/user/hive/..]", "startOffset" : { "sourceVersion" : 1, "reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a", "reservoirVersion" : 3, "index" : 2, "isStartingVersion" : true }, "endOffset" : { "sourceVersion" : 1, "reservoirId" : "483e5927-1c26-4af5-b7a2-31a1fee8983a", "reservoirVersion" : 3, "index" : 3, "isStartingVersion" : true } "numInputRows" : 1, "sink" : { "description" : "DeltaSink[dbfs:/user/hive/warehouse/..]", "numOutputRows" : -1 } JSON
  • 15. Metrics Categories Performance metrics JSON "inputRowsPerSecond" : 0.016666666666666666, "processedRowsPerSecond" : 0.2986857825567503 Key metrics InputRowsPerSecond ● The rate at which data is arriving from this source into stream ● Doesn’t represent the ingestion rate at source processedRowsPerSecond ● The rate at which data from this source is being processed by Spark ● processing rate > input rate, indicates stream slowness Notebook UI
  • 17. Sample Delta Lake stream structured streaming spark .readStream .format("delta") .option("maxFilesPerTrigger",5) .table("delta_keyval") ... .writeStream .format("delta") .table("delta_stream") .trigger("60 seconds") .option("checkpointLocation", "…") .start() source sink source sink
  • 18. Sample Delta Lake stream spark .readStream .format("delta") .option("maxFilesPerTrigger",5) .table("delta_keyval") ... .writeStream .format("delta") .table("delta_stream") .trigger("60 seconds") .option("checkpointLocation", "…") .start() source sink ● Number of files ingested per microbatch ● If not specified stream will fallback to default value (1000) ● Always affects all delta streams
  • 19. Sample Delta Lake stream spark .readStream .format("delta") .option("maxFilesPerTrigger",5) .table("delta_keyval") ... .writeStream .format("delta") .table("delta_stream") .trigger("60 seconds") .option("checkpointLocation", "…") .start() source sink ● Using Processing time trigger ● Default - No trigger ● Microbatch mode
  • 20. Delta source - Stream Mapping Source Table Details 0 1 2 3 4 5 6 7 version 0 Data Table History
  • 21. Delta source - Stream Mapping Delta to Delta Stream Source table spark .readStream .format("delta") .option("maxFilesPerTrigger",5) .table("delta_keyval") .writeStream .format("delta") .table("delta_kv_stream") .option("checkpointLocation", "…") .start() Stream Destination Table
  • 22. Delta source - Stream Mapping Delta to Delta Stream Source table history "runId" : "324f1e17-4fae-4e1a-..." "batchId" : 0, ... "sources" : [ { "description" : "DeltaSource[...]", "startOffset" : null, "endOffset" : { "sourceVersion" : 1, "reservoirId": "744f0c51-48e6-482d-...", "reservoirVersion" : 0, "index" : 4, "isStartingVersion" : true } "numInputRows" : 10240, ], Query Progress log for first batch Destination Table History after first batch
  • 23. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version null 0 4 true 10240 Delta source - Stream Mapping First batch (BatchId 0) Source history QPL for batch 0 0 1 2 3 4 5 6 7 version 0 file0 to file4 is processed in batch 0, since maxFilesPerTrigger is set to 5
  • 24. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 0 4 true 0 7 true 6144 Delta source - Stream Mapping Second batch (BatchId 1) Source history QPL for batch 1 0 1 2 3 4 5 6 7 version 0 file5 to file7 is processed in batch 1
  • 25. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 0 7 true 0 7 true 0 Delta source - Stream Mapping Third batch (BatchId 2) Source history QPL for batch 2 0 1 2 3 4 5 6 7 version 0 No files left for processing in batch 2
  • 26. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 0 7 true 1 4 false 10240 Delta source - Stream Mapping Third batch (BatchId 2 with new data) Source history QPL for batch 2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 version 0 version 1
  • 27. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 0 7 true 1 4 false 10240 Delta source - Stream Mapping Third batch (BatchId 2 with new data) Source history QPL for batch 2 file0 to file4 from version1 is processed in batch 2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 version 0 version 1
  • 28. startOffset endOffset num Input Rows reservoir version index IsStarting Version reservoir version index IsStarting Version 2 -1 false 2 -1 false 0 Delta source - Stream Mapping Last batch (BatchId 4) Source history QPL for batch 4 No new files left for processing in batch 4 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 version 0 version 1 0 1.. version 2 .. Next Possible version
  • 30. Streaming Stages STRUCTURED STREAMING Checkpoint Simple Stream Different Steps Status in notebook Step 1 Construct Microbatch ● Fetch source offsets ● Commit offsets Step 2 Process Microbatch Step 3 Commit Microbatch
  • 31. Streaming Stages STRUCTURED STREAMING Checkpoint Simple Stream Different Steps Status in notebook Step 1 Construct Microbatch ● Fetch source offsets ● Commit offsets Step 2 Process Microbatch Step 3 Commit Microbatch
  • 32. Streaming Checkpoint ● Tracks Streaming query ● All data is stored as JSON ● Contents ● offsets ● metadata ● commits Step 2 Process Microbatch Step 3 Commit Microbatch Step 1 Construct Microbatch
  • 33. Streaming Checkpoint - Offsets ● One file per microbatch ● Offset file is generated when batch starts ● Following details are stored a. Batch and streaming state details b. Source details Step 1 Construct Microbatch
  • 34. Streaming Checkpoint - Offset mapping dbfs:/mnt/stream/simple_stream/checkpoint/offsets/0 filename "startOffset" : null, "endOffset" : { "sourceVersion" : 1, "reservoirId": "483e5927-1c26-4af5-b7a2-", "reservoirVersion" : 0, "index" : 4, "isStartingVersion" : true } content query progress log file0 to file4 is processed in batch 0 (first batch) Step 1 Construct Microbatch
  • 35. Streaming Checkpoint - Metadata ● Metadata of the stream ● Stream-id is generated when stream starts ● Remain same for the lifetime of checkpoint directory
  • 36. ● One file per microbatch is generated ● File is generated only if batch completes ● Query restarts check num commit files == num of offset files ○ True -> Start a new batch ○ False -> Finish previously batch Streaming Checkpoint - Commits Step 3 Commit Microbatch
  • 37. Summary QPL is the source of truth for stream execution Delta log <-> QPL Mapping explains stream execution Delta Streaming Query Progress Logs Offsets and Commits are key to maintain stream state Stream Checkpoint
  • 38. What Next ?? ● Other variations of Delta stream ○ (TriggerOnce, maxBytesPerTrigger) ● Delta table properties ● Common pitfalls & mitigation strategies
  • 39. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.