Achieving Lakehouse Models with Spark 3.0

•

3 gostaram•636 visualizações

It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?

Dados e análise

Achieving Lakehouse Models
with Spark 3.0
Simon Whiteley
Director of Engineering, Advancing Analytics

Agenda
Why Lakehouse?
Kimball Problems
Delta & Spark 3.0
▪ SCD & SQL Merge
▪ Dynamic Partition Pruning
▪ Adaptive Query Execution
Enabling the Lakehouse

The Lakehouse
RAW BASE ENRICHED
DELTA DELTA

When you think Warehouse…
We automatically think of Star
Schemas and Kimball warehousing
approaches.
A large central fact table with smaller
reference dimensions… some of which
aren’t so small

Literally Everyone
(All The time)
“You can’t use Kimball in a
Data Lake”

Three Historical Challenges
▪ Slowly Changing Dimensions
▪ Filtering Dimensions
▪ General SQL Performance

SCD - Enabling the Familiar
PrimaryKey Address Current EffectiveDate EndDate
11 A new customer address TRUE 03/08/2020 null
58 Yet another address TRUE 03/08/2020 null
41 A different address TRUE 03/08/2020 null
PrimaryKey Address Current EffectiveDate EndDate
11 A new customer address FALSE 03/08/2020 22/10/2020
11 An updated address TRUE 22/10/2020 null
58 Yet another address TRUE 03/08/2020 null
41 A different address TRUE 03/08/2020 null

SCD - Merge Commands
MERGE INTO dataai.addresses as original
USING updates
ON original.primaryKey = updates.primaryKey
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
Available in SQL, Scala and Python APIs - the merge command has made many complex warehousing
jobs accessible to the wider Analytics community
This is enabled by the Delta file format

Spark Partitioning
SELECT * FROM Sales WHERE Month
= 3
SQL Query Action
Filtering performed by
selectively reading files
SALES
Month=1 Month=2
Month=3 Month=4

Cross-Filter Spark 2.4
SELECT * FROM Sales JOIN Date
WHERE DateMonth = 3
SQL Query Action
SALES
Month=1 Month=2
Month=3 Month=4
DimDATE
Partition Keys not hit when
filtering on joined tables

Cross-Filter Spark 3.0
SELECT * FROM Sales JOIN Date
WHERE DateMonth = 3
SQL Query Action
SALES
Month=1 Month=2
Month=3 Month=4
DimDATE
Dynamic Partition Pruning
determines partition filters
during runtime

AQE in Spark 3.0
AQE will speed up common queries in a number of ways:
▪ Coalescing Shuffle Partitions
▪ Switching Join Strategies
▪ Optimizing Skew Joins

Before AQE - Shuffle Coalescing
Read RDDs
Read RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Write RDDs
Write RDDs
2 Tasks 200 Tasks 2 Tasks

Using Spark 3.0 AQE - Shuffle Coalescing
Read RDDs
Read RDDs
Write RDDs
Write RDDs
2 Tasks 2 Tasks 2 Tasks
Shuffle RDDs
Shuffle RDDs

The Data Lakehouse
Delta & Spark 3.0 enable the Lakehouse through:
▪ Enabling familiar (SQL) patterns
▪ Removing technical barriers
▪ Targeting performance of common
warehousing activities

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Mais conteúdo relacionado

Mais procurados

Owning Your Own (Data) Lake HouseData Con LA

Delta lake and the delta architectureAdam Doyle

Intro to Delta LakeDatabricks

Databricks Delta Lake and Its BenefitsDatabricks

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

Databricks Platform.pptxAlex Ivy

TechEvent Databricks on AzureTrivadis

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

From Data Warehouse to LakehouseModern Data Stack France

Technical Deck Delta Live Tables.pdfIlham31574

Data MeshPiethein Strengholt

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

Introducing Databricks DeltaDatabricks

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Modularized ETL Writing with Apache SparkDatabricks

Moving to Databricks & DeltaDatabricks

Databricks FundamentalsDalibor Wijas

Making Apache Spark Better with Delta LakeDatabricks

Time to Talk about Data MeshLibbySchulze

Mais procurados (20)

Owning Your Own (Data) Lake House

Delta lake and the delta architecture

Intro to Delta Lake

Databricks Delta Lake and Its Benefits

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...

Databricks Platform.pptx

TechEvent Databricks on Azure

A Thorough Comparison of Delta Lake, Iceberg and Hudi

From Data Warehouse to Lakehouse

Technical Deck Delta Live Tables.pdf

Data Mesh

Architect’s Open-Source Guide for a Data Mesh Architecture

Introducing Databricks Delta

Data Mesh Part 4 Monolith to Mesh

Data Lakehouse, Data Mesh, and Data Fabric (r1)

Modularized ETL Writing with Apache Spark

Moving to Databricks & Delta

Databricks Fundamentals

Making Apache Spark Better with Delta Lake

Time to Talk about Data Mesh

Semelhante a Achieving Lakehouse Models with Spark 3.0

Jump Start on Apache Spark 2.2 with DatabricksAnyscale

Svccg nosql 2011_v4Sid Anand

Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret

700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan

700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

SQL Analytics Powering Telemetry Analysis at ComcastDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Advanced ASE Performance Tuning Tips SAP Technology

Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...StreamNative

NewSQL - Deliverance from BASE and back to SQL and ACIDTony Rogerson

Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media

Getting Started with Databricks SQL AnalyticsDatabricks

Tech-Spark: Scaling DatabasesRalph Attard

Spark with Delta LakeKnoldus Inc.

Presto for apps deck varada prestoconfOri Reshef

Understanding Query Plans and Spark UIsDatabricks

Sql Server 2005 Business Inteligenceabercius24

Modernise your Data Warehouse - AWS Summit Sydney 2018Amazon Web Services

Semelhante a Achieving Lakehouse Models with Spark 3.0 (20)

Jump Start on Apache Spark 2.2 with Databricks

Svccg nosql 2011_v4

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

700 Updatable Queries Per Second: Spark as a Real-Time Web Service

700 Queries Per Second with Updates: Spark As A Real-Time Web Service

Jump Start with Apache Spark 2.0 on Databricks

SQL Analytics Powering Telemetry Analysis at Comcast

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Advanced ASE Performance Tuning Tips

Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...

NewSQL - Deliverance from BASE and back to SQL and ACID

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day

Getting Started with Databricks SQL Analytics

Tech-Spark: Scaling Databases

Spark with Delta Lake

Presto for apps deck varada prestoconf

Understanding Query Plans and Spark UIs

Sql Server 2005 Business Inteligence

Modernise your Data Warehouse - AWS Summit Sydney 2018

Mais de Databricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Mais de Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Último

April 2024 - Crypto Market Report's Analysismanisha194592

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Halmar dropshipping via API with DroFxolyaivanovalion

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Week-01-2.ppt BBB human Computer interactionfulawalesam

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Industrialised data - the key to AI success.pdfLars Albertsson

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Achieving Lakehouse Models with Spark 3.0

1. Achieving Lakehouse Models with Spark 3.0 Simon Whiteley Director of Engineering, Advancing Analytics

2. Agenda Why Lakehouse? Kimball Problems Delta & Spark 3.0 ▪ SCD & SQL Merge ▪ Dynamic Partition Pruning ▪ Adaptive Query Execution Enabling the Lakehouse

3. The Data Lakehouse

4. Analytics Evolution PARQUET Delta Lake

6. The Modern Warehouse RAW BASE PARQUET

7. The Lakehouse RAW BASE ENRICHED DELTA DELTA

8. Lakehouse Barriers

9. When you think Warehouse… We automatically think of Star Schemas and Kimball warehousing approaches. A large central fact table with smaller reference dimensions… some of which aren’t so small

10. Literally Everyone (All The time) “You can’t use Kimball in a Data Lake”

11. Three Historical Challenges ▪ Slowly Changing Dimensions ▪ Filtering Dimensions ▪ General SQL Performance

12. Slowly Changing Dimensions

13. SCD - Enabling the Familiar PrimaryKey Address Current EffectiveDate EndDate 11 A new customer address TRUE 03/08/2020 null 58 Yet another address TRUE 03/08/2020 null 41 A different address TRUE 03/08/2020 null PrimaryKey Address Current EffectiveDate EndDate 11 A new customer address FALSE 03/08/2020 22/10/2020 11 An updated address TRUE 22/10/2020 null 58 Yet another address TRUE 03/08/2020 null 41 A different address TRUE 03/08/2020 null

14. SCD - Merge Commands MERGE INTO dataai.addresses as original USING updates ON original.primaryKey = updates.primaryKey WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * Available in SQL, Scala and Python APIs - the merge command has made many complex warehousing jobs accessible to the wider Analytics community This is enabled by the Delta file format

15. Dynamic Partition Pruning

16. Spark Partitioning SELECT * FROM Sales WHERE Month = 3 SQL Query Action Filtering performed by selectively reading files SALES Month=1 Month=2 Month=3 Month=4

17. Cross-Filter Spark 2.4 SELECT * FROM Sales JOIN Date WHERE DateMonth = 3 SQL Query Action SALES Month=1 Month=2 Month=3 Month=4 DimDATE Partition Keys not hit when filtering on joined tables

18. Cross-Filter Spark 3.0 SELECT * FROM Sales JOIN Date WHERE DateMonth = 3 SQL Query Action SALES Month=1 Month=2 Month=3 Month=4 DimDATE Dynamic Partition Pruning determines partition filters during runtime

19. Adaptive Query Execution

20. AQE in Spark 3.0 AQE will speed up common queries in a number of ways: ▪ Coalescing Shuffle Partitions ▪ Switching Join Strategies ▪ Optimizing Skew Joins

21. Before AQE - Shuffle Coalescing Read RDDs Read RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Write RDDs Write RDDs 2 Tasks 200 Tasks 2 Tasks

22. Using Spark 3.0 AQE - Shuffle Coalescing Read RDDs Read RDDs Write RDDs Write RDDs 2 Tasks 2 Tasks 2 Tasks Shuffle RDDs Shuffle RDDs

23. DEMO: Let’s see it in action

24. The Data Lakehouse Delta & Spark 3.0 enable the Lakehouse through: ▪ Enabling familiar (SQL) patterns ▪ Removing technical barriers ▪ Targeting performance of common warehousing activities

25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Achieving Lakehouse Models with Spark 3.0

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Achieving Lakehouse Models with Spark 3.0

Semelhante a Achieving Lakehouse Models with Spark 3.0 (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Achieving Lakehouse Models with Spark 3.0