SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Harikrishnan Kunhumveettil & Mathan Pillai
Operating and Supporting Delta
Lake in Production
Who we are?
Mathan PillaiHarikrishnan Kunhumveettil
Currently
Sr.TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Sr.TSE MapR
Hadoop Tech. Lead, Nielsen
Currently
Sr. TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Tech Lead ,Intersys Consulting
Sr Big data Consultant ,Saama
Technologies
Agenda
■ Delta Lake in Production - Data
○ Optimize and Auto-Optimize - Overview
○ Choosing the right strategy - The What
○ Choosing the right strategy - The When
○ Choosing the right strategy - The Where
■ Delta Lake in Production - Metadata
○ Sneak Peek Into Delta Log
○ Delta Log Configs
○ Delta Log Misconception
○ Delta Log Exceptions
○ Tips & Tricks
Delta Lake in Production - Data
Optimize and Auto-Optimize - In a nutshell
▪ Bin-
packing/Compaction
▪ Handles small file
problem
▪ Idempotent
▪ Incremental
▪ Creates 1 GB file or
10M records
▪ Controlled by
optimize.maxFileSize
▪ Helps in Data
Skipping
▪ Use Range
Partitioning
▪ Hilbert Curve In
Preview
▪ Partially incremental
▪ Supports
all/new/minCubeSize
▪ Controlled by
optimize.zorder.mergeS
trategy.minCubeSize.th
reshold
OPTIMIZE + ZORDEROPTIMIZE
▪ Unintentionally
referred as Auto-
optimize
▪ Introduce an extra
shuffle phase
▪ Creates row-
compressed data of
512mb (binSize)
▪ Output file ~128 mb
▪ Controlled by
optimizeWrite.binSize
Optimize Write
▪ Mini-Optimize
▪ Creates file as big
as 128 MB
▪ Post-commit
action
▪ Triggered when
more than 50
files/directory
▪ Controlled by:
autoCompact.minNumFi
les
autoCompact.maxFileS
ize
Auto-Compaction
Note: All configurations with a prefix “spark.databricks.delta”. eg: spark.databricks.delta.optimizeWrite.binSize
Choosing the right strategy - The What? strategy
● Optimize writes:
○ Misconception - does not work with Streaming workloads
○ Makes life easy for OPTIMIZE and VACUUM
○ In terms of number of files, Map Only writes can be very expensive. Optimize writes can do magic!
3.2 PB
~ 700 TB input data
~ 400 TB new writes
OPTIMIZE takes ~ 6 -8
hours
Run Optimize job 3
times/day
OPTIMIZE WRITE
OPTIMIZE Job takes 2-3
hours.
Run optimize 4 times/day
More than 40% resource
saved on OPTIMIZE
Choosing the right strategy - The What? strategy
● Z-Order Vs Partition By
○ Z-order is better than creating large number of small files.
○ More effective use of DBIO cache through the handling of less metadata
326 TB
3 partitions
25 million files
326 TB
2 partitions
650k files
Choosing the right strategy - The What? strategy
import com.databricks.sql.transaction.tahoe.DeltaLog
import org.apache.hadoop.fs.Path
val deltaPath = "<table_path>"
val deltaLog = DeltaLog(spark, new Path(deltaPath + "/_delta_log"))
val currentFiles = deltaLog.snapshot.allFiles
display(currentFiles.groupBy("partitionValues.col").count().orderBy($"count".desc))
Choosing the right strategy - The When? strategy
● Auto-Optimize runs on the same cluster during/after a write.
● Optimize - Trade off between read performance and cost
● Delay Z-Ordering if you are continuously adding data on active partition.
○ If active reads are not on the latest partition
○ optimize.zorder.mergeStrategy.minCubeSize.threshold is 100 GB by default
○ Reducing the value to make Z-order run time efficient, degrades the read performance
● Should I always run OPTIMIZE + VACUUM ?
○ VACUUM happens on the Spark Driver.
○ Roughly 200k files/hour in ADLS
○ Roughly 300k files/hour in AWS S3
○ DRY RUN gives the estimate
Choosing the right strategy - The Where? strategy
● Auto-optimize runs on the same cluster during/after a write.
● Z-ordering is CPU intensive.
○ Involves Parquet Decoding and Encoding
○ General purpose instances vs Compute optimized clusters.
● Always have “where” clause for OPTIMIZE queries
● Auto-scaling clusters for VACUUM only workloads
Delta Lake in Production - Metadata
Delta Lake Transaction Log
■ Sneak Peek Into Delta Log
■ Delta Log Configs
■ Delta Exceptions
■ Tips & Tricks
Sneak Peek Into Delta Log
Who What When Where
Version N Who What When Where
Version N-1 Who What When Where
Version N-2 Who What When Where
Sneak Peek Into Delta Log
Who ?
Sneak Peek Into Delta Log
What ?
Sneak Peek Into Delta Log
When ?
Sneak Peek Into Delta Log
Where ?
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Delta Log Configs
LogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
JSON
JSON
Delta Log Configs
CheckpointRetentionDurationLogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
How long checkpoint files are kept ?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.checkpointRetentionDur
ation' = '7 days')
PARQUET
PARQUET
You can drive in
parallel in a
freeway, but not in
a tunnel
Delta Exceptions
concurrentModificationException Analogy
Delta Exceptions
concurrentModificationException Analogy
You can drive in
parallel in a
freeway, but not
in a tunnel
Delta Exceptions
concurrentModificationException
Verify if concurrent updates happened to same partition
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read
concurrentDeleteDeleteException
Concurrent operation deleted a file that your operation deletes
SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0
Tips & Tricks
How to find what records were added between 2 versions of Delta Table ?
%scala
display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json")
.where("add is not null")
.select("add.path"))
Tips & Tricks
How to find what files were added in a specific version of Delta Table ?
val oldestVersionAvailable =
val newestVersionAvailable =
val pathToDeltaTable = ""
val pathToFileName = ""
(oldestVersionAvailable to newestVersionAvailable).map { version =>
var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%020d.json")
if (df1.columns.toSeq.contains("remove")) {
var df2 = df1.where("remove is not null").select("remove.path")
var df3 = df2.filter('path.contains(pathToFileName))
if (df3.count > 0)
print(s"Commit Version $version removed the file $pathToFileName n")
}
}
Tips & Tricks
How to find which delta commit removed a specific file ?
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 

Semelhante a Operating and Supporting Delta Lake in Production

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 

Semelhante a Operating and Supporting Delta Lake in Production (20)

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Shared Database Concurrency
Shared Database ConcurrencyShared Database Concurrency
Shared Database Concurrency
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
11g R2
11g R211g R2
11g R2
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 

Último (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 

Operating and Supporting Delta Lake in Production

  • 1. Harikrishnan Kunhumveettil & Mathan Pillai Operating and Supporting Delta Lake in Production
  • 2. Who we are? Mathan PillaiHarikrishnan Kunhumveettil Currently Sr.TSE ,Databricks. Areas: Spark SQL, Delta, SS Previously Sr.TSE MapR Hadoop Tech. Lead, Nielsen Currently Sr. TSE ,Databricks. Areas: Spark SQL, Delta, SS Previously Tech Lead ,Intersys Consulting Sr Big data Consultant ,Saama Technologies
  • 3. Agenda ■ Delta Lake in Production - Data ○ Optimize and Auto-Optimize - Overview ○ Choosing the right strategy - The What ○ Choosing the right strategy - The When ○ Choosing the right strategy - The Where ■ Delta Lake in Production - Metadata ○ Sneak Peek Into Delta Log ○ Delta Log Configs ○ Delta Log Misconception ○ Delta Log Exceptions ○ Tips & Tricks
  • 4. Delta Lake in Production - Data
  • 5. Optimize and Auto-Optimize - In a nutshell ▪ Bin- packing/Compaction ▪ Handles small file problem ▪ Idempotent ▪ Incremental ▪ Creates 1 GB file or 10M records ▪ Controlled by optimize.maxFileSize ▪ Helps in Data Skipping ▪ Use Range Partitioning ▪ Hilbert Curve In Preview ▪ Partially incremental ▪ Supports all/new/minCubeSize ▪ Controlled by optimize.zorder.mergeS trategy.minCubeSize.th reshold OPTIMIZE + ZORDEROPTIMIZE ▪ Unintentionally referred as Auto- optimize ▪ Introduce an extra shuffle phase ▪ Creates row- compressed data of 512mb (binSize) ▪ Output file ~128 mb ▪ Controlled by optimizeWrite.binSize Optimize Write ▪ Mini-Optimize ▪ Creates file as big as 128 MB ▪ Post-commit action ▪ Triggered when more than 50 files/directory ▪ Controlled by: autoCompact.minNumFi les autoCompact.maxFileS ize Auto-Compaction Note: All configurations with a prefix “spark.databricks.delta”. eg: spark.databricks.delta.optimizeWrite.binSize
  • 6. Choosing the right strategy - The What? strategy ● Optimize writes: ○ Misconception - does not work with Streaming workloads ○ Makes life easy for OPTIMIZE and VACUUM ○ In terms of number of files, Map Only writes can be very expensive. Optimize writes can do magic! 3.2 PB ~ 700 TB input data ~ 400 TB new writes OPTIMIZE takes ~ 6 -8 hours Run Optimize job 3 times/day OPTIMIZE WRITE OPTIMIZE Job takes 2-3 hours. Run optimize 4 times/day More than 40% resource saved on OPTIMIZE
  • 7. Choosing the right strategy - The What? strategy ● Z-Order Vs Partition By ○ Z-order is better than creating large number of small files. ○ More effective use of DBIO cache through the handling of less metadata 326 TB 3 partitions 25 million files 326 TB 2 partitions 650k files
  • 8. Choosing the right strategy - The What? strategy import com.databricks.sql.transaction.tahoe.DeltaLog import org.apache.hadoop.fs.Path val deltaPath = "<table_path>" val deltaLog = DeltaLog(spark, new Path(deltaPath + "/_delta_log")) val currentFiles = deltaLog.snapshot.allFiles display(currentFiles.groupBy("partitionValues.col").count().orderBy($"count".desc))
  • 9. Choosing the right strategy - The When? strategy ● Auto-Optimize runs on the same cluster during/after a write. ● Optimize - Trade off between read performance and cost ● Delay Z-Ordering if you are continuously adding data on active partition. ○ If active reads are not on the latest partition ○ optimize.zorder.mergeStrategy.minCubeSize.threshold is 100 GB by default ○ Reducing the value to make Z-order run time efficient, degrades the read performance ● Should I always run OPTIMIZE + VACUUM ? ○ VACUUM happens on the Spark Driver. ○ Roughly 200k files/hour in ADLS ○ Roughly 300k files/hour in AWS S3 ○ DRY RUN gives the estimate
  • 10. Choosing the right strategy - The Where? strategy ● Auto-optimize runs on the same cluster during/after a write. ● Z-ordering is CPU intensive. ○ Involves Parquet Decoding and Encoding ○ General purpose instances vs Compute optimized clusters. ● Always have “where” clause for OPTIMIZE queries ● Auto-scaling clusters for VACUUM only workloads
  • 11. Delta Lake in Production - Metadata
  • 12. Delta Lake Transaction Log ■ Sneak Peek Into Delta Log ■ Delta Log Configs ■ Delta Exceptions ■ Tips & Tricks
  • 13. Sneak Peek Into Delta Log Who What When Where Version N Who What When Where Version N-1 Who What When Where Version N-2 Who What When Where
  • 14. Sneak Peek Into Delta Log Who ?
  • 15. Sneak Peek Into Delta Log What ?
  • 16. Sneak Peek Into Delta Log When ?
  • 17. Sneak Peek Into Delta Log Where ?
  • 18. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 19. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 20. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 21. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 22. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 23. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 24. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 25. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 26. Delta Log Configs LogRetentionDuration How long log files are kept? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.logRetentionDuration'=' 7 days') JSON JSON
  • 27. Delta Log Configs CheckpointRetentionDurationLogRetentionDuration How long log files are kept? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.logRetentionDuration'=' 7 days') How long checkpoint files are kept ? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.checkpointRetentionDur ation' = '7 days') PARQUET PARQUET
  • 28. You can drive in parallel in a freeway, but not in a tunnel Delta Exceptions concurrentModificationException Analogy
  • 29. Delta Exceptions concurrentModificationException Analogy You can drive in parallel in a freeway, but not in a tunnel
  • 30. Delta Exceptions concurrentModificationException Verify if concurrent updates happened to same partition
  • 31. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads
  • 32. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads concurrentDeleteReadException Concurrent operation deleted a file that your operation read
  • 33. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads concurrentDeleteReadException Concurrent operation deleted a file that your operation read concurrentDeleteDeleteException Concurrent operation deleted a file that your operation deletes
  • 34. SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0 Tips & Tricks How to find what records were added between 2 versions of Delta Table ?
  • 35. %scala display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json") .where("add is not null") .select("add.path")) Tips & Tricks How to find what files were added in a specific version of Delta Table ?
  • 36. val oldestVersionAvailable = val newestVersionAvailable = val pathToDeltaTable = "" val pathToFileName = "" (oldestVersionAvailable to newestVersionAvailable).map { version => var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%020d.json") if (df1.columns.toSeq.contains("remove")) { var df2 = df1.where("remove is not null").select("remove.path") var df3 = df2.filter('path.contains(pathToFileName)) if (df3.count > 0) print(s"Commit Version $version removed the file $pathToFileName n") } } Tips & Tricks How to find which delta commit removed a specific file ?
  • 37. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.