In this talk, we will present recent enhancements to the techniques previously discussed in this blog: https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html. We will start by discussing the different CDC architectures that can be deployed in concert with Databricks Delta. We will then use notebooks to demonstrate updated CDC SQL and look at performance tuning considerations for both batch as well as streaming CDC pipelines into Delta.
2. Ameet Kini, Databricks
April 24, 2019
Simplifying Change Data Capture
Using Delta Lakes
#UnifiedAnalytics #SparkAISummit
3. About Me
Current: Regional Manager (Federal) of Resident Architects @ Databricks
…
A while ago
• OS Geo on Spark 0.6
• Almost spoke at Spark Summit 2014
…
A while while ago
• MapReduce @ Google (in 2006 pre Hadoop)
• Kernel Dev @ Oracle
…
4. Outline
• Journey through evolution of CDC in Databricks
– Pretty architecture diagrams
• Understand what goes behind the scenes
– “Pretty” SQL Query plans J
• Preview of key upcoming features
4#UnifiedAnalytics #SparkAISummit
8. What worked and what did not?
8#UnifiedAnalytics #SparkAISummit
Worked
• Least Disruptive adding Databricks to existing stack
• Easy to get started with spark.read.jdbc
Did not work
• No $$$ savings or EDW compute offload
• EDW overloaded, which added constraints on when S3
refresh jobs could be scheduled
• Refresh rates are at best nightly due to concurrent read /
write limitations of vanilla Parquet
10. With Delta circa 2018
10#UnifiedAnalytics #SparkAISummit
Oracle CDC
Tables captured
using database triggers
Every refresh period, run these two
1. INSERT into staging table
2. INSERT OVERWRITE modified
partitions of final table
See https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html
11. What worked and what did not?
11#UnifiedAnalytics #SparkAISummit
Worked
• Delta removed dependency on EDW for CDC
• Refresh rates went from nightly to sub-hourly
• Easy to scale to multiple pipelines using features like
notebook workflows and jobs
Did not work
• Scheme relied on effective partitioning to minimize
updates, requires domain specific knowledge
• Where there is no effective partitioning, Step 2 is
effectively overwriting most of table…S..L..O..W
See https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html
12. Efficient Upserts in Delta
MERGE INTO users
USING changes ON users.userId = changes.userId
WHEN MATCHED AND FLAG=’D’ THEN DELETE
WHEN MATCHED AND FLAG<>’D’
THEN UPDATE address = changes.addresses
WHEN NOT MATCHED
THEN INSERT (userId, address)
VALUES (changes.userId, changes.address)
12#UnifiedAnalytics #SparkAISummit
Deletes
Updates
Inserts
Source Table
Target Table
A single command
to process all
three action types
Expanded syntax of MERGE INTO introduced in Databricks Runtime 5.1
See https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
13. Works for Streaming and Batch
13#UnifiedAnalytics #SparkAISummit
streamingSessionUpdatesDF.writeStream.foreachBatch { (microBatchOutputDF: DataFrame, batchId: Long) =>
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF.sparkSession.sql(s"""
MERGE INTO users
USING changes ON users.userId = changes.userId
WHEN MATCHED AND FLAG=’D’
THEN DELETE
WHEN MATCHED AND FLAG<>’D’
THEN UPDATE address = changes.addresses
WHEN NOT MATCHED
THEN INSERT (userId, address) VALUES (changes.userId, changes.address)
""")
}.start()
See https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
14. Used by MySQL Replication
14#UnifiedAnalytics #SparkAISummit
Public Preview in DBR-5.3 https://docs.databricks.com/delta/mysql-delta.html#mysql-delta
What: CDC MySQL tables into Delta
From: MySQL tables in binlog format
To: Delta
15. With Delta Now
15#UnifiedAnalytics #SparkAISummit
Oracle CDC
Tables captured
using database triggers
Every refresh period, run these two
1. INSERT into staging table
2. INSERT OVERWRITE modified
partitions of final table
Every refresh period, MERGE
changes into table
16. Visually
16#UnifiedAnalytics #SparkAISummit
1 3
5 6
7 9
Updates
Users
New FilesOld Files
Files with ”Insert” Records
Files with ”Update” Records
Files with ”Delete” Records
Partition 1
Partition 2
Partition 3
2
4
8
10
11
12
Delta marks these files stale and eligible for vacuum
17. Outline
• Journey through evolution of CDC in Databricks
– Pretty architecture diagrams
• Understand what goes behind the scenes
– “Pretty” SQL Query plans J
• Preview of key upcoming features
17#UnifiedAnalytics #SparkAISummit
19. A tale of two joins
MERGE runs two joins
• Inner Join
– Between Source and Target
– Goal: find files that need to be modified (e.g., files 2, 4, 8)
• Outer Join
– Between Source and subset-of-files-identified-by-Inner-Join
– Goal: write out modified and unmodified data together (e.g., ”New Files”)
19#UnifiedAnalytics #SparkAISummit
20. Say, if you run this…
20#UnifiedAnalytics #SparkAISummit
Merging a 1000-row source into a 100 million row target, using TPC-DS
21. The two joins under the hood…
21#UnifiedAnalytics #SparkAISummit
Runs two joins
• Inner Join
– Between Source and Target
– Goal: find files that need to be modified (e.g., files 2, 4, 8)
• Outer Join
– Between Source and subset-of-files-identified-by-Inner-Join
– Goal: write out modified and unmodified data together (e.g., ”New Files”)
Inner Join takes 7s
Outer Join takes 32s
22. Let’s peek inside the inner join
22#UnifiedAnalytics #SparkAISummit
Optimizer picks Broadcast Hash Join
Suitable choice when joining small table (source) with large (target)
23. But what if it picks Sort Merge instead?
23#UnifiedAnalytics #SparkAISummit
Same 7s inner join now takes 16s … 2x slower!
24. This is what Sort Merge looks like
24#UnifiedAnalytics #SparkAISummit
25. Inner Join Summary
If |source| << |target|, nudge optimizer into picking broadcast hash join
• Ensure stats are collected on join columns
• Increase spark.sql.autoBroadcastJoinThreshold
appropriately (default: 10MB)
• Use optimizer hints (with joins, does not apply to MERGE)
SELECT /*+ BROADCAST(source) */ ...
25#UnifiedAnalytics #SparkAISummit
26. Next: Outer Join
26#UnifiedAnalytics #SparkAISummit
Runs two joins
• Inner Join
– Between Source and Target
– Goal: find files that need to be modified (e.g., files 2, 4, 8)
• Outer Join
– Between Source and subset-of-files-identified-by-Inner-Join
– Goal: write out modified and unmodified data together (e.g., ”New Files”)
Inner Join takes 7s
Outer Join takes 32s
S3 writes
27. Outer Join latency tied to…
…amount of data re-written (gray boxes below), S3 writes are slow
27#UnifiedAnalytics #SparkAISummit
Here we’re writing 3 new files
28. Let’s see the numbers…
28#UnifiedAnalytics #SparkAISummit
Target table store_sales_100m
• 100 million rows
• Compacted into 5 parquet files of 1G each (OPTIMIZE ameet.store_sales_100m)
Source table
• 1000 rows
• Drawn from 1, 3, and all 5 files
29. Outer Join is write-bound
29#UnifiedAnalytics #SparkAISummit
Key take-aways
• Outer Join time is directly tied to amount of
data written
• Inner Join time is a small proportion of overall
time and does not change as amount of data
written increases
7 6 7
32
66
84
0
10
20
30
40
50
60
70
80
90
100
1x1GB 3x1GB 5x1GB
Time(seconds)
# of files modified
Inner Join Outer Join
31. Outer Joins are faster as files get smaller
31#UnifiedAnalytics #SparkAISummit
MERGE on a smaller file takes 18 seconds instead of 39!
0
20
40
60
80
100
1x36MB 1x1GB 3x1GB 5x1GB
Time(seconds)
# of files modified
Inner Join Outer Join
See first two bars
32. But queries get slower with more small files
32#UnifiedAnalytics #SparkAISummit
Query: select count(*)
from ameet.store_sales_100m
where ss_sold_time_sk=48472)
“Scan” operator is at the root of most queries
Same 100-million row table takes
• 1 second with 5x1GB files, versus
• 12 seconds with 1355 smaller files
33. OPTIMIZE until now…
Creates large compacted files
• Default: 1GB (controlled by spark.databricks.delta.optimize.maxFileSize)
• Large files great for queries, not for MERGE
• Small files great for MERGE, not for queries
• Complexity in controlling when and where to OPTIMIZE
33#UnifiedAnalytics #SparkAISummit
34. OPTIMIZE future…is here
Auto Optimize Project
• Adaptive Shuffling controls # and size of files written out
• Automatically triggers a faster OPTIMIZE after files are written out
• Strives for 128MB files
34#UnifiedAnalytics #SparkAISummit
Private Preview in DBR-5.3
https://docs.databricks.com/release-notes/runtime/5.3.html#private-preview-features
36. Summary
Use MERGE INTO for CDC into Delta Lakes
• Unified API for Batch and Streaming
• Efficient: Broadcast joins, Partition Pruning, Compaction, Optimistic Concurrency Control
• Reliable: ACID guarantees on cloud storage, Schema Enforcement, S3 commit service
36#UnifiedAnalytics #SparkAISummit
37. Summary (contd.)
If you’re diagnosing / tuning MERGE performance
• Inner Join to find files that are modified
– Tip: ensure it uses broadcast hash join wherever applicable
• Outer Join to write modified and unmodified files together
– Latency directly tied to time to write data out to cloud storage
– Tip: Consider using Auto Optimize starting DBR 5.3
37#UnifiedAnalytics #SparkAISummit
38. Related Talks
• (Wed 1:40pm)
Productizing Structured Streaming Jobs
- Burak Yavuz
• (Thurs 4:40pm)
Apache Spark Core – Deep Dive – Proper Optimization
– Daniel Tomes
• (Wed 11:00am, Thurs 4:40pm)
Building Robust Production Data Pipelines with Databricks Delta
– Joe Widen, Steven Yu, Burak Yavuz
38#UnifiedAnalytics #SparkAISummit