SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Optimizing Merge on Delta Lake
Justin Breese
Who am I?
Justin Breese
justin.breese@databricks.com | Los Angeles
Senior Strategic Solutions Architect
Drums, guitar, soccer, and old Porsches
Agenda
▪ Merge basics
▪ Partition/File pruning
▪ OperationMetrics
▪ Large merge tips
▪ Sample configs
▪ Various ramblings and observations
Merge overview
▪ Phase 1: Find the input files in target that are touched by the rows that
satisfy the condition and verify that no two source rows match with the
same target row [innerJoin]
▪ Phase 2: Read the touched files again and write new files with updated
and/or inserted rows
▪ Phase 3: Use the Delta protocol to atomically remove the touched files and
add the new files
Merge overview: phase 2 double click
Phase 2: Read the touched files again and write new files with updated and/or
inserted rows.
The type of join can vary depending on the conditions of the merge:
▪ Insert only merge (e.g. no updates/deletes) → leftAntiJoin on the source to
find the inserts
▪ Matched only clauses (e.g. when matched) → rightOuterJoin
▪ Else (e.g. you have updates, deletes, and inserts) → fullOuterJoin
Merge is really these three phases. Now that we know that, we can figure out
how to optimize each phase.
Merge basics
▪ Smaller workers (2xlarge) perform better than larger workers:
16xlarge with 1200 cores
2xlarge with 1200 cores
Same merge, different instances
Merge basics
▪ Tale of two joins: inner join and full outer join
▪ Want to go faster? Partition pruning and file pruning
▪ Unpersist dataframes that you don’t need - clear up your memory:
df.unpersist
System.gc
▪ Change Delta file size depending on your use case (default 1GB)
spark.databricks.delta.optimize.maxFileSize sizeInBytes
Write intensive: 32MB or less
Read intensive: 1GB (default Delta size)
*We are working on changing this for you automatically
▪ Normal Spark rules apply: partitionSize, shuffle partitions, etc.
innerJoin
full outer join +
optimizeWrite (optional)
write to s3/adls
Prunes: not just delicious juice for my grandparents
▪ Partition prune: disregard specific partitions
▪ File prune: disregard specific files within a partition
▪ You have to be explicit about both of these - kb on this topic - if
you do not tell Delta to prune, then it won’t
→ We will improve this in the future to be more automagic
▪ Prune on the left (source) and the right (target)
Partition Prune Example
// get a partition prune string from the date partition
var ListofDatesList = sourceDF.select($"date").distinct.collect.map(x => x.apply(0).toString())
var partitionPruneString = "'"+ListofDatesList.mkString("','")+"'"
// use the pruning string for partitions
val source = sourceDF.filter(s"""date in (${ partitionPruneString })""") // pruned the left side
baselineTable .as("baseline")
.merge(broadcast(source.as("inputs")), "baseline.date IN (" + partitionPruneString + ")" + "AND baseline.compositePk =
inputs.compositePk ")
.whenMatched("inputs.deleted = true")
.delete()
.whenMatched("inputs.deleted = false")
.updateExpr…...
OMG partition pruning!
Matching PK
You will know it worked if PartitionCount < totalPartitions in your table for the physical plan
Broadcast if you can
File Prune Example
// get a partition prune string
var ListofDatesList = sourceDF.select($"date").distinct.collect.map(x => x.apply(0).toString())
var partitionPruneString = "'"+ListofDatesList.mkString("','")+"'"
// use the pruning string for partitions
val source = sourceDF.filter(s"""date in (${ partitionPruneString })""") // pruned the left side
baselineTable .as("baseline")
.merge(broadcast( source.as("inputs")), "baseline.date IN (" + partitionPruneString + ") AND baseline.zOrderedCol < 123 AND
baseline.compositePk = inputs.compositePk ")
.whenMatched("inputs.deleted = true")
.delete()
.whenMatched("inputs.deleted = false")
.updateExpr…...
OMG partition pruning!
Matching PK
File pruning!
Operation Metrics (%sql describe history tableName)
▪ Use DBR 6.5+ to get improved operationMetrics
▪ They are THE source of truth for a DML event
▪ Things to look at:
→ numTargetRowsCopied (this is the enemy!)
→ numOutputBytes
→ numTargetFilesAdded
→ numTargetRowsInserted/Updated/Deleted
Operation Metrics continued
If numTargetRowsCopied is insanely high (relative
to the amount of rows in the entire table)
▪ Rethink how you’re laying out your data:
→ Partition differently
→ Use a zOrder
→ Use smaller file sizes
Large merge tips
▪ s3 bucket: write at the root - s3 parallelism is defined by the prefix
Each large table should have its own s3 bucket and another bucket for
checkpointing (if it is a stream)
Good! :-)
s3:/jbreese-databricks-bucket
--year=2019
--year=2018
Bad! :-/
s3:/jbreese-databricks-bucket/data/tableA
--year=2019
--year=2018
s3:/jbreese-databricks-bucket/data/tableB
--year=2019
--year=2018
‘data’ is considered a prefix and you are subject to s3 prefix limits
▪ s3 prefix limits: 3500 reads / 5500 writes per second
* yes I know that S3 will eventually re-partition - it just depends on how long it takes or how patient you are
Large merge tips
▪ Using a huge cluster (more than 900 cores):
optimizedWrites along with Delta random prefixes and write at root
optimizedWrites ensure 1 core writes to 1 partition (via a final shuffle)
Configs
spark.hadoop.fs.s3a.multipart.threshold 204857600
spark.databricks.delta.optimizeWrite true
spark.databricks.delta.optimizeWrite.numShuffleBlocks xxxxxx
spark.databricks.delta.properties.defaults.randomizeFilePrefixes true
spark.databricks.optimizer.dynamicFilePruning true
This works: I wrote a 2.7TB changeset with 2400 cores in 17 minutes - no s3
throttling
Final recap
▪ Merge basics
▪ Partition/File pruning
▪ OperationMetrics
▪ Large merge tips
▪ Sample configs
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Mais conteúdo relacionado

Mais procurados

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 

Mais procurados (20)

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 

Semelhante a Delta Lake: Optimizing Merge

Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm treesTilak Patidar
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Db2 performance tuning for dummies
Db2 performance tuning for dummiesDb2 performance tuning for dummies
Db2 performance tuning for dummiesAngel Dueñas Neyra
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesAleksander Alekseev
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMarco Tusa
 
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...Citus Data
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkChristophePraud2
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
 
Deployment Preparedness
Deployment Preparedness Deployment Preparedness
Deployment Preparedness MongoDB
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Factmediumdata
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep divehuguk
 

Semelhante a Delta Lake: Optimizing Merge (20)

Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Db2 performance tuning for dummies
Db2 performance tuning for dummiesDb2 performance tuning for dummies
Db2 performance tuning for dummies
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
 
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in Spark
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
Deployment Preparedness
Deployment Preparedness Deployment Preparedness
Deployment Preparedness
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Fact
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep dive
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 

Último (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Delta Lake: Optimizing Merge

  • 1. Optimizing Merge on Delta Lake Justin Breese
  • 2. Who am I? Justin Breese justin.breese@databricks.com | Los Angeles Senior Strategic Solutions Architect Drums, guitar, soccer, and old Porsches
  • 3. Agenda ▪ Merge basics ▪ Partition/File pruning ▪ OperationMetrics ▪ Large merge tips ▪ Sample configs ▪ Various ramblings and observations
  • 4. Merge overview ▪ Phase 1: Find the input files in target that are touched by the rows that satisfy the condition and verify that no two source rows match with the same target row [innerJoin] ▪ Phase 2: Read the touched files again and write new files with updated and/or inserted rows ▪ Phase 3: Use the Delta protocol to atomically remove the touched files and add the new files
  • 5. Merge overview: phase 2 double click Phase 2: Read the touched files again and write new files with updated and/or inserted rows. The type of join can vary depending on the conditions of the merge: ▪ Insert only merge (e.g. no updates/deletes) → leftAntiJoin on the source to find the inserts ▪ Matched only clauses (e.g. when matched) → rightOuterJoin ▪ Else (e.g. you have updates, deletes, and inserts) → fullOuterJoin Merge is really these three phases. Now that we know that, we can figure out how to optimize each phase.
  • 6. Merge basics ▪ Smaller workers (2xlarge) perform better than larger workers: 16xlarge with 1200 cores 2xlarge with 1200 cores Same merge, different instances
  • 7. Merge basics ▪ Tale of two joins: inner join and full outer join ▪ Want to go faster? Partition pruning and file pruning ▪ Unpersist dataframes that you don’t need - clear up your memory: df.unpersist System.gc ▪ Change Delta file size depending on your use case (default 1GB) spark.databricks.delta.optimize.maxFileSize sizeInBytes Write intensive: 32MB or less Read intensive: 1GB (default Delta size) *We are working on changing this for you automatically ▪ Normal Spark rules apply: partitionSize, shuffle partitions, etc. innerJoin full outer join + optimizeWrite (optional) write to s3/adls
  • 8. Prunes: not just delicious juice for my grandparents ▪ Partition prune: disregard specific partitions ▪ File prune: disregard specific files within a partition ▪ You have to be explicit about both of these - kb on this topic - if you do not tell Delta to prune, then it won’t → We will improve this in the future to be more automagic ▪ Prune on the left (source) and the right (target)
  • 9. Partition Prune Example // get a partition prune string from the date partition var ListofDatesList = sourceDF.select($"date").distinct.collect.map(x => x.apply(0).toString()) var partitionPruneString = "'"+ListofDatesList.mkString("','")+"'" // use the pruning string for partitions val source = sourceDF.filter(s"""date in (${ partitionPruneString })""") // pruned the left side baselineTable .as("baseline") .merge(broadcast(source.as("inputs")), "baseline.date IN (" + partitionPruneString + ")" + "AND baseline.compositePk = inputs.compositePk ") .whenMatched("inputs.deleted = true") .delete() .whenMatched("inputs.deleted = false") .updateExpr…... OMG partition pruning! Matching PK You will know it worked if PartitionCount < totalPartitions in your table for the physical plan Broadcast if you can
  • 10. File Prune Example // get a partition prune string var ListofDatesList = sourceDF.select($"date").distinct.collect.map(x => x.apply(0).toString()) var partitionPruneString = "'"+ListofDatesList.mkString("','")+"'" // use the pruning string for partitions val source = sourceDF.filter(s"""date in (${ partitionPruneString })""") // pruned the left side baselineTable .as("baseline") .merge(broadcast( source.as("inputs")), "baseline.date IN (" + partitionPruneString + ") AND baseline.zOrderedCol < 123 AND baseline.compositePk = inputs.compositePk ") .whenMatched("inputs.deleted = true") .delete() .whenMatched("inputs.deleted = false") .updateExpr…... OMG partition pruning! Matching PK File pruning!
  • 11. Operation Metrics (%sql describe history tableName) ▪ Use DBR 6.5+ to get improved operationMetrics ▪ They are THE source of truth for a DML event ▪ Things to look at: → numTargetRowsCopied (this is the enemy!) → numOutputBytes → numTargetFilesAdded → numTargetRowsInserted/Updated/Deleted
  • 12. Operation Metrics continued If numTargetRowsCopied is insanely high (relative to the amount of rows in the entire table) ▪ Rethink how you’re laying out your data: → Partition differently → Use a zOrder → Use smaller file sizes
  • 13. Large merge tips ▪ s3 bucket: write at the root - s3 parallelism is defined by the prefix Each large table should have its own s3 bucket and another bucket for checkpointing (if it is a stream) Good! :-) s3:/jbreese-databricks-bucket --year=2019 --year=2018 Bad! :-/ s3:/jbreese-databricks-bucket/data/tableA --year=2019 --year=2018 s3:/jbreese-databricks-bucket/data/tableB --year=2019 --year=2018 ‘data’ is considered a prefix and you are subject to s3 prefix limits ▪ s3 prefix limits: 3500 reads / 5500 writes per second * yes I know that S3 will eventually re-partition - it just depends on how long it takes or how patient you are
  • 14. Large merge tips ▪ Using a huge cluster (more than 900 cores): optimizedWrites along with Delta random prefixes and write at root optimizedWrites ensure 1 core writes to 1 partition (via a final shuffle) Configs spark.hadoop.fs.s3a.multipart.threshold 204857600 spark.databricks.delta.optimizeWrite true spark.databricks.delta.optimizeWrite.numShuffleBlocks xxxxxx spark.databricks.delta.properties.defaults.randomizeFilePrefixes true spark.databricks.optimizer.dynamicFilePruning true This works: I wrote a 2.7TB changeset with 2400 cores in 17 minutes - no s3 throttling
  • 15. Final recap ▪ Merge basics ▪ Partition/File pruning ▪ OperationMetrics ▪ Large merge tips ▪ Sample configs
  • 16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.