SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
STRATEGIC PARTNER
14.05.2023 SQLDay 2023 1
14.05.2023 SQLDay 2023 2
Azure Databricks –
performance tuning
Adrian Chodkowski
O mnie
• Adrian Chodkowski
• Microsoft Data Platform MVP
• Data engineer & Architekt w Elitmind
• Specjalizacja: Platforma danych Microsoft (Azure & On-premise)
• Data Community
• seequality.net
• Adrian.Chodkowski@outlook.com
• @Twitter: Adrian_SQL
• LinkedIn: http://tinyurl.com/adrian-sql
14.05.2023 SQLDay 2023 3
AGENDA
• Databricks disk cache
• AutoLoader
• Static & Dynamic Partition pruning
• File pruning
• Z-Ordering
• Additional tips & summary
14.05.2023 SQLDay 2023 4
Level: 300
14.05.2023 SQLDay 2023 5
Databricks disk cache
aka Delta cache, DBIO
Disk caching
• Disk caching behavior is a proprietary Databricks feature
previously called Delta Cache & DBIO,
• It is different than Spark cache!
• Works with parquet and delta,
• Databricks uses disk caching to accelerate data reads by
creating copies of remote Parquet data files in nodes’ local
storage,
• Successive reads of the same data are then performed
locally,
• After turning on it can be managed automatically or it can
be forced by using CACHE SELECT syntax,
• You benefit most by using cache-accelerated worker
instance types
14.05.2023 SQLDay 2023 6
Disk cache vs Spark
cache
• Prefer use of disk cache!
14.05.2023 SQLDay 2023 7
Disk caching -
configuration
• spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")
• Be careful with autoscaling :
• Spark.databricks.io.cache.maxDiskUsage – disk space per node reserved for cached data in bytes,
• Spark.databricks.io.cache.maxMetaDataCache – disk space per node reserved for cached metadata in
bytes,
• Spark.databricks.io.cache.compression.enabled – should the cached data be stored in compressed
format.
14.05.2023 SQLDay 2023 8
14.05.2023 SQLDay 2023 12
Ingestion
With Autoloader
Autoloader
• Functionality based on structured
streaming that can identify and load
new files,
• Works in two modes:
• Directory Listing – internally lists
structure of files and partitions. Can be
done incrementally
• File Notification - leverage Event Grid
and storage queues to track new files
appearance
• Avoid overwrites the file,
• Have a backfill strategy defined!
• In many cases consider it as a default
loading mechanism!
• There is also COPY INTO!
14.05.2023 SQLDay 2023 15
14.05.2023 SQLDay 2023 16
Partitioning
Dynamic Pruning & Design
Partitioning
• Division of a table to the hierarchy of folders,
• Can be beneficial when reading data (partition discovery)
and optimizing (OPTIMIZE WHERE partition key)
• Databricks recommendation:
• don’t use it for tables smaller than 1 TB,
• Each partition >1GB,
• When reading and filter by partition key then only
specific partition can be read (rest will be skipped)
• Can be created using PARTITIONED BY or Partition
keywords,
• Don’t use CREATE TABLE PARTITION BY AS SELECT – it will
add hive overhead and can take ages to finish!
14.05.2023 SQLDay 2023 17
Year=2003
Month=01
Month=02
Month=03
Month=04
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
Dynamic Partition
Pruning
14.05.2023 18
• Especially useful in star schema
• Introduced in Spark 3.0
• Let’s assume query:
SELECT d.c1, f.c2
FROM Fact AS f
JOIN Dimension as d
ON f.join_key = d.join_key
WHERE d.c2 =10
• Since there is a filter on one table (d.c2 = 10),
internally DPP can create a subquery:
SELECT d.join_key FROM Dimension AS d
WHERE d.c2=10;
• and then broadcast this sub-query result, so that we can use this
result to prune partitions for "t1".
Dynamic Partition
Pruning
14.05.2023 19
• Especially useful in star schema
• Introduced in Spark 3.0
• Let’s assume query:
SELECT d.c1, f.c2
FROM Fact AS f
JOIN Dimension as d
ON f.join_key = d.join_key
WHERE d.c2 =10
• Since there is a filter on one table (d.c2 = 10),
internally DPP can create a subquery:
SELECT d.join_key FROM Dimension AS d
WHERE d.c2=10;
• and then broadcast this sub-query result, so that we can use this
result to prune partitions for "t1".
14.05.2023 SQLDay 2023 20
File pruning
By using stats
File pruning
• In addition to eliminating data at partition granularity, Delta Lake on
Databricks dynamically skips unnecessary files when possible.
• Delta Lake automatically collects metadata about data files (files can be
skipped without data file access)
• Statistics are taken for the first 32 columns (can be changed).
14.05.2023 SQLDay 2023 21
SELECT * FROM
Dimension AS d
WHERE
d.category_id IN
(1,2,5);
File 1
File 2
File 3
File 4
File 5
File 6
File Column Min Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
File 6 Category_id 1 1
File pruning
• In addition to eliminating data at partition granularity, Delta Lake on
Databricks dynamically skips unnecessary files when possible,
• Delta Lake automatically collects metadata about data files (files can be
skipped without data file access),
• Statistics are taken for the first 32 columns (can be changed).
14.05.2023 SQLDay 2023 22
SELECT * FROM
Dimension AS d
WHERE
d.category_id IN
(1,2,5);
File Column Min Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
File 6 Category_id 1 1
File 1
File 2
File 3
File 4
File 5
File 6
14.05.2023 SQLDay 2023 23
Zordering
By using OPTIMIZE
OPTIMIZE Z-Order
• OPTIMIZE optimizes the layout of Delta Lake,
• By Default OPTIMIZE set max file size to 1GB (1073741824),
• You can control size by using spark.databricks.delta.optimize.maxFileSize
• Can be used based on column or bin-packing optimization:
• Bin-packing – idempotent technique that aims to produce evenly-balanced data files with
respect to their size on disk (not number of rows),
• Z-ordering not-idempotent technique that aims to produce evenly-balanced data files with
respect to the number of rows (not size on disk)
14.05.2023 SQLDay 2023 24
OPTIMIZE events OPTIMIZE events WHERE
>= '2017-01-01' OPTIMIZE events WHERE
>= current_timestamp() -
1 day ZORDER BY (eventType)
OPTIMIZE Z-Order
14.05.2023 SQLDay 2023 25
File 1(800MB)
File 2 (2200MB)
File 3 (900 MB)
File 4 (300 MB)
File 5 (100 MB)
File 1
(1GB)
File 2
(1GB)
File 3
(1GB)
File Column Mi
n
Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
OPTIMIZE table
ZORDER BY (category_id)
File Column Mi
n
Max
File 1 Category_id 1 10
File 2 Category_id 11 20
File 3 Category_id 20 100
Dynamic File pruning
• Files can be skipped based on join not literal values
• To make it happen following requirements must be met:
• Inner table (probe) being joined is in Delta format
• Joint type is INNER or LEFT-SEMI
• Join Strategy is BROADCAST HASH JOIN
• Number of files in the inner table is greater than value set in
spark.databricks.optimizer.deltaTableFilesThreshold (default
1000)
• Spark.databricks.optimizer.dynamicFilePruning should be True
(default)
• Size of inner table should be more than
spark.databricks.optimizer.deltaTableSize
(default 10GB)
14.05.2023 SQLDay 2023 26
Other
• Use newest Databricks Runtime,
• For fast Update/Merge re-write the least amount of files:
spark.databricks.delta.optimize.maxfilesize 16-128MB or turn on optimized writes.
• Don’t use Python or Scala UDF if native function exists – transfer data between Python
and Spark = serialization is needed = drastically slows down queries
• Move numericals, keys, high cardinality query predicates to the left, long string that are
not distinct enough for stats collection move to the left (only 32 columns has statistics)
• OPTIMIZE benefits from Compute Optimized clusters (because of a lot of encoding and
decoding parquet files)
• Think about spot instances,
• For some operations consider Databricks Standard.
14.05.2023 SQLDay 2023 27
Adaptive Query
Processing
• Game changer in Spark 3.x
• For example: Initially SortMergeJoin chosen
but once the ingest stages completes plan
will be updated to use BroadcastHashJoin
14.05.2023 SQLDay 2023 28
Other
• Turn Adaptive Query Execution (default)
• Turn Coalesce Partitions on
(spark.sql.adaptive.coalescePartitions.enable
d)
• Turn Skew Join On
(spark.sql.adaptive.skewJoin.enabled)
• Turn Local Shuffle Reader on
(spark.sql.adaptive.localShuffleReader.enable
d)
• Broadcast Join threshold
(spark.sql.autoBroadcastJoinThreshold)
• Don’t prefer Sort MergeJoin
spark.sql.join.prefersortmergejoin -> false
14.05.2023 SQLDay 2023 29
• Turn off stats collection
• dataSkippingNumIndexedCols 0
• Optimize Zorder by merge keys between Bronze & Silver
• Turn Optimized Writes
• Restructure columns for skipping
• Optimize ZOrder by join keys or High Cardinality columns used
in WHERE
• Turn Optimized Writes
• Enable Databricks IO cache
• Consider using Photon
• Consider using Premium Storage
• Build preaggregate tables
14.05.2023 SQLDay 2023 30
Dziękuję!
14.05.2023 SQLDay 2023 34
STRATEGIC PARTNER

Mais conteúdo relacionado

Semelhante a SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning

Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 
Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)IDERA Software
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxPuneet Vijwani
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningDatabricks
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 
Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Ivan Sanders
 
Splitgraph: AHL talk
Splitgraph: AHL talkSplitgraph: AHL talk
Splitgraph: AHL talkSplitgraph
 
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL ServerGeek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL ServerIDERA Software
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeTorsten Steinbach
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 

Semelhante a SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning (20)

Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013
 
Splitgraph: AHL talk
Splitgraph: AHL talkSplitgraph: AHL talk
Splitgraph: AHL talk
 
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL ServerGeek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 

Último

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 

Último (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning

  • 2. 14.05.2023 SQLDay 2023 2 Azure Databricks – performance tuning Adrian Chodkowski
  • 3. O mnie • Adrian Chodkowski • Microsoft Data Platform MVP • Data engineer & Architekt w Elitmind • Specjalizacja: Platforma danych Microsoft (Azure & On-premise) • Data Community • seequality.net • Adrian.Chodkowski@outlook.com • @Twitter: Adrian_SQL • LinkedIn: http://tinyurl.com/adrian-sql 14.05.2023 SQLDay 2023 3
  • 4. AGENDA • Databricks disk cache • AutoLoader • Static & Dynamic Partition pruning • File pruning • Z-Ordering • Additional tips & summary 14.05.2023 SQLDay 2023 4 Level: 300
  • 5. 14.05.2023 SQLDay 2023 5 Databricks disk cache aka Delta cache, DBIO
  • 6. Disk caching • Disk caching behavior is a proprietary Databricks feature previously called Delta Cache & DBIO, • It is different than Spark cache! • Works with parquet and delta, • Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage, • Successive reads of the same data are then performed locally, • After turning on it can be managed automatically or it can be forced by using CACHE SELECT syntax, • You benefit most by using cache-accelerated worker instance types 14.05.2023 SQLDay 2023 6
  • 7. Disk cache vs Spark cache • Prefer use of disk cache! 14.05.2023 SQLDay 2023 7
  • 8. Disk caching - configuration • spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]") • Be careful with autoscaling : • Spark.databricks.io.cache.maxDiskUsage – disk space per node reserved for cached data in bytes, • Spark.databricks.io.cache.maxMetaDataCache – disk space per node reserved for cached metadata in bytes, • Spark.databricks.io.cache.compression.enabled – should the cached data be stored in compressed format. 14.05.2023 SQLDay 2023 8
  • 9. 14.05.2023 SQLDay 2023 12 Ingestion With Autoloader
  • 10. Autoloader • Functionality based on structured streaming that can identify and load new files, • Works in two modes: • Directory Listing – internally lists structure of files and partitions. Can be done incrementally • File Notification - leverage Event Grid and storage queues to track new files appearance • Avoid overwrites the file, • Have a backfill strategy defined! • In many cases consider it as a default loading mechanism! • There is also COPY INTO! 14.05.2023 SQLDay 2023 15
  • 11. 14.05.2023 SQLDay 2023 16 Partitioning Dynamic Pruning & Design
  • 12. Partitioning • Division of a table to the hierarchy of folders, • Can be beneficial when reading data (partition discovery) and optimizing (OPTIMIZE WHERE partition key) • Databricks recommendation: • don’t use it for tables smaller than 1 TB, • Each partition >1GB, • When reading and filter by partition key then only specific partition can be read (rest will be skipped) • Can be created using PARTITIONED BY or Partition keywords, • Don’t use CREATE TABLE PARTITION BY AS SELECT – it will add hive overhead and can take ages to finish! 14.05.2023 SQLDay 2023 17 Year=2003 Month=01 Month=02 Month=03 Month=04 *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet
  • 13. Dynamic Partition Pruning 14.05.2023 18 • Especially useful in star schema • Introduced in Spark 3.0 • Let’s assume query: SELECT d.c1, f.c2 FROM Fact AS f JOIN Dimension as d ON f.join_key = d.join_key WHERE d.c2 =10 • Since there is a filter on one table (d.c2 = 10), internally DPP can create a subquery: SELECT d.join_key FROM Dimension AS d WHERE d.c2=10; • and then broadcast this sub-query result, so that we can use this result to prune partitions for "t1".
  • 14. Dynamic Partition Pruning 14.05.2023 19 • Especially useful in star schema • Introduced in Spark 3.0 • Let’s assume query: SELECT d.c1, f.c2 FROM Fact AS f JOIN Dimension as d ON f.join_key = d.join_key WHERE d.c2 =10 • Since there is a filter on one table (d.c2 = 10), internally DPP can create a subquery: SELECT d.join_key FROM Dimension AS d WHERE d.c2=10; • and then broadcast this sub-query result, so that we can use this result to prune partitions for "t1".
  • 15. 14.05.2023 SQLDay 2023 20 File pruning By using stats
  • 16. File pruning • In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible. • Delta Lake automatically collects metadata about data files (files can be skipped without data file access) • Statistics are taken for the first 32 columns (can be changed). 14.05.2023 SQLDay 2023 21 SELECT * FROM Dimension AS d WHERE d.category_id IN (1,2,5); File 1 File 2 File 3 File 4 File 5 File 6 File Column Min Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 File 6 Category_id 1 1
  • 17. File pruning • In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible, • Delta Lake automatically collects metadata about data files (files can be skipped without data file access), • Statistics are taken for the first 32 columns (can be changed). 14.05.2023 SQLDay 2023 22 SELECT * FROM Dimension AS d WHERE d.category_id IN (1,2,5); File Column Min Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 File 6 Category_id 1 1 File 1 File 2 File 3 File 4 File 5 File 6
  • 18. 14.05.2023 SQLDay 2023 23 Zordering By using OPTIMIZE
  • 19. OPTIMIZE Z-Order • OPTIMIZE optimizes the layout of Delta Lake, • By Default OPTIMIZE set max file size to 1GB (1073741824), • You can control size by using spark.databricks.delta.optimize.maxFileSize • Can be used based on column or bin-packing optimization: • Bin-packing – idempotent technique that aims to produce evenly-balanced data files with respect to their size on disk (not number of rows), • Z-ordering not-idempotent technique that aims to produce evenly-balanced data files with respect to the number of rows (not size on disk) 14.05.2023 SQLDay 2023 24 OPTIMIZE events OPTIMIZE events WHERE >= '2017-01-01' OPTIMIZE events WHERE >= current_timestamp() - 1 day ZORDER BY (eventType)
  • 20. OPTIMIZE Z-Order 14.05.2023 SQLDay 2023 25 File 1(800MB) File 2 (2200MB) File 3 (900 MB) File 4 (300 MB) File 5 (100 MB) File 1 (1GB) File 2 (1GB) File 3 (1GB) File Column Mi n Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 OPTIMIZE table ZORDER BY (category_id) File Column Mi n Max File 1 Category_id 1 10 File 2 Category_id 11 20 File 3 Category_id 20 100
  • 21. Dynamic File pruning • Files can be skipped based on join not literal values • To make it happen following requirements must be met: • Inner table (probe) being joined is in Delta format • Joint type is INNER or LEFT-SEMI • Join Strategy is BROADCAST HASH JOIN • Number of files in the inner table is greater than value set in spark.databricks.optimizer.deltaTableFilesThreshold (default 1000) • Spark.databricks.optimizer.dynamicFilePruning should be True (default) • Size of inner table should be more than spark.databricks.optimizer.deltaTableSize (default 10GB) 14.05.2023 SQLDay 2023 26
  • 22. Other • Use newest Databricks Runtime, • For fast Update/Merge re-write the least amount of files: spark.databricks.delta.optimize.maxfilesize 16-128MB or turn on optimized writes. • Don’t use Python or Scala UDF if native function exists – transfer data between Python and Spark = serialization is needed = drastically slows down queries • Move numericals, keys, high cardinality query predicates to the left, long string that are not distinct enough for stats collection move to the left (only 32 columns has statistics) • OPTIMIZE benefits from Compute Optimized clusters (because of a lot of encoding and decoding parquet files) • Think about spot instances, • For some operations consider Databricks Standard. 14.05.2023 SQLDay 2023 27
  • 23. Adaptive Query Processing • Game changer in Spark 3.x • For example: Initially SortMergeJoin chosen but once the ingest stages completes plan will be updated to use BroadcastHashJoin 14.05.2023 SQLDay 2023 28
  • 24. Other • Turn Adaptive Query Execution (default) • Turn Coalesce Partitions on (spark.sql.adaptive.coalescePartitions.enable d) • Turn Skew Join On (spark.sql.adaptive.skewJoin.enabled) • Turn Local Shuffle Reader on (spark.sql.adaptive.localShuffleReader.enable d) • Broadcast Join threshold (spark.sql.autoBroadcastJoinThreshold) • Don’t prefer Sort MergeJoin spark.sql.join.prefersortmergejoin -> false 14.05.2023 SQLDay 2023 29 • Turn off stats collection • dataSkippingNumIndexedCols 0 • Optimize Zorder by merge keys between Bronze & Silver • Turn Optimized Writes • Restructure columns for skipping • Optimize ZOrder by join keys or High Cardinality columns used in WHERE • Turn Optimized Writes • Enable Databricks IO cache • Consider using Photon • Consider using Premium Storage • Build preaggregate tables
  • 25. 14.05.2023 SQLDay 2023 30 Dziękuję!
  • 26. 14.05.2023 SQLDay 2023 34 STRATEGIC PARTNER