Simplifying Change Data Capture using Databricks Delta

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Ameet Kini, Databricks
April 24, 2019
Simplifying Change Data Capture
Using Delta Lakes
#UnifiedAnalytics #SparkAISummit

About Me
Current: Regional Manager (Federal) of Resident Architects @ Databricks
…
A while ago
• OS Geo on Spark 0.6
• Almost spoke at Spark Summit 2014
…
A while while ago
• MapReduce @ Google (in 2006 pre Hadoop)
• Kernel Dev @ Oracle
…

Outline
• Journey through evolution of CDC in Databricks
– Pretty architecture diagrams
• Understand what goes behind the scenes
– “Pretty” SQL Query plans J
• Preview of key upcoming features
4#UnifiedAnalytics #SparkAISummit

Change Data Capture
What: Collect and Merge changes
From: One or more sources
To: One or more destinations

Historically…

CDC with Databricks circa 2017

What worked and what did not?
Worked
• Least Disruptive adding Databricks to existing stack
• Easy to get started with spark.read.jdbc
Did not work
• No $$$ savings or EDW compute offload
• EDW overloaded, which added constraints on when S3
refresh jobs could be scheduled
• Refresh rates are at best nightly due to concurrent read /
write limitations of vanilla Parquet

Delta simplifies the stack…

With Delta circa 2018
Oracle CDC
Tables captured
using database triggers
Every refresh period, run these two
1. INSERT into staging table
2. INSERT OVERWRITE modified
partitions of final table
See https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html

What worked and what did not?
Worked
• Delta removed dependency on EDW for CDC
• Refresh rates went from nightly to sub-hourly
• Easy to scale to multiple pipelines using features like
notebook workflows and jobs
Did not work
• Scheme relied on effective partitioning to minimize
updates, requires domain specific knowledge
• Where there is no effective partitioning, Step 2 is
effectively overwriting most of table…S..L..O..W
See https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html

Efficient Upserts in Delta
MERGE INTO users
USING changes ON users.userId = changes.userId
WHEN MATCHED AND FLAG=’D’ THEN DELETE
WHEN MATCHED AND FLAG<>’D’
THEN UPDATE address = changes.addresses
WHEN NOT MATCHED
THEN INSERT (userId, address)
VALUES (changes.userId, changes.address)
Deletes
Updates
Inserts
Source Table
Target Table
A single command
to process all
three action types
Expanded syntax of MERGE INTO introduced in Databricks Runtime 5.1
See https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html

Works for Streaming and Batch
streamingSessionUpdatesDF.writeStream.foreachBatch { (microBatchOutputDF: DataFrame, batchId: Long) =>
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF.sparkSession.sql(s"""
MERGE INTO users
USING changes ON users.userId = changes.userId
WHEN MATCHED AND FLAG=’D’
THEN DELETE
WHEN MATCHED AND FLAG<>’D’
THEN UPDATE address = changes.addresses
WHEN NOT MATCHED
THEN INSERT (userId, address) VALUES (changes.userId, changes.address)
""")
}.start()
See https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html

Used by MySQL Replication
Public Preview in DBR-5.3 https://docs.databricks.com/delta/mysql-delta.html#mysql-delta
What: CDC MySQL tables into Delta
From: MySQL tables in binlog format
To: Delta

With Delta Now
Oracle CDC
Tables captured
using database triggers
Every refresh period, run these two
1. INSERT into staging table
2. INSERT OVERWRITE modified
partitions of final table
Every refresh period, MERGE
changes into table

Visually
1 3
5 6
7 9
Updates
Users
New FilesOld Files
Files with ”Insert” Records
Files with ”Update” Records
Files with ”Delete” Records
Partition 1
Partition 2
Partition 3
2
4
8
10
11
12
Delta marks these files stale and eligible for vacuum

Outline
• Journey through evolution of CDC in Databricks
– Pretty architecture diagrams
• Understand what goes behind the scenes
– “Pretty” SQL Query plans J
• Preview of key upcoming features

A deep dive into MERGE

A tale of two joins
MERGE runs two joins
• Inner Join
– Between Source and Target
– Goal: find files that need to be modified (e.g., files 2, 4, 8)
• Outer Join
– Between Source and subset-of-files-identified-by-Inner-Join
– Goal: write out modified and unmodified data together (e.g., ”New Files”)

Say, if you run this…
Merging a 1000-row source into a 100 million row target, using TPC-DS

The two joins under the hood…
Runs two joins
• Inner Join
• Outer Join
Inner Join takes 7s
Outer Join takes 32s

Let’s peek inside the inner join
Optimizer picks Broadcast Hash Join
Suitable choice when joining small table (source) with large (target)

But what if it picks Sort Merge instead?
Same 7s inner join now takes 16s … 2x slower!

This is what Sort Merge looks like

Inner Join Summary
If |source| << |target|, nudge optimizer into picking broadcast hash join
• Ensure stats are collected on join columns
• Increase spark.sql.autoBroadcastJoinThreshold
appropriately (default: 10MB)
• Use optimizer hints (with joins, does not apply to MERGE)
SELECT /*+ BROADCAST(source) */ ...

Next: Outer Join
Runs two joins
• Inner Join
• Outer Join
Inner Join takes 7s
Outer Join takes 32s
S3 writes

Outer Join latency tied to…
…amount of data re-written (gray boxes below), S3 writes are slow
Here we’re writing 3 new files

Let’s see the numbers…
Target table store_sales_100m
• 100 million rows
• Compacted into 5 parquet files of 1G each (OPTIMIZE ameet.store_sales_100m)
Source table
• 1000 rows
• Drawn from 1, 3, and all 5 files

Outer Join is write-bound
Key take-aways
• Outer Join time is directly tied to amount of
data written
• Inner Join time is a small proportion of overall
time and does not change as amount of data
written increases
7 6 7
32
66
84
0
10
20
30
40
50
60
70
80
90
100
1x1GB 3x1GB 5x1GB
Time(seconds)
# of files modified
Inner Join Outer Join

MERGE creates small files
Cause: spark.sql.shuffle.partitions – default 200

Outer Joins are faster as files get smaller
MERGE on a smaller file takes 18 seconds instead of 39!
0
20
40
60
80
100
1x36MB 1x1GB 3x1GB 5x1GB
Time(seconds)
# of files modified
Inner Join Outer Join
See first two bars

But queries get slower with more small files
Query: select count(*)
from ameet.store_sales_100m
where ss_sold_time_sk=48472)
“Scan” operator is at the root of most queries
Same 100-million row table takes
• 1 second with 5x1GB files, versus
• 12 seconds with 1355 smaller files

OPTIMIZE until now…
Creates large compacted files
• Default: 1GB (controlled by spark.databricks.delta.optimize.maxFileSize)
• Large files great for queries, not for MERGE
• Small files great for MERGE, not for queries
• Complexity in controlling when and where to OPTIMIZE

OPTIMIZE future…is here
Auto Optimize Project
• Adaptive Shuffling controls # and size of files written out
• Automatically triggers a faster OPTIMIZE after files are written out
• Strives for 128MB files
Private Preview in DBR-5.3
https://docs.databricks.com/release-notes/runtime/5.3.html#private-preview-features

Wrap-up

Summary
Use MERGE INTO for CDC into Delta Lakes
• Unified API for Batch and Streaming
• Efficient: Broadcast joins, Partition Pruning, Compaction, Optimistic Concurrency Control
• Reliable: ACID guarantees on cloud storage, Schema Enforcement, S3 commit service

Summary (contd.)
If you’re diagnosing / tuning MERGE performance
• Inner Join to find files that are modified
– Tip: ensure it uses broadcast hash join wherever applicable
• Outer Join to write modified and unmodified files together
– Latency directly tied to time to write data out to cloud storage
– Tip: Consider using Auto Optimize starting DBR 5.3

Related Talks
• (Wed 1:40pm)
Productizing Structured Streaming Jobs
- Burak Yavuz
• (Thurs 4:40pm)
Apache Spark Core – Deep Dive – Proper Optimization
– Daniel Tomes
• (Wed 11:00am, Thurs 4:40pm)
Building Robust Production Data Pipelines with Databricks Delta
– Joe Widen, Steven Yu, Burak Yavuz

Simplifying Change Data Capture using Databricks Delta

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Simplifying Change Data Capture using Databricks Delta

Semelhante a Simplifying Change Data Capture using Databricks Delta (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Simplifying Change Data Capture using Databricks Delta