The Apache Pulsar community is collaborating with the Delta Lake community to add up to both ecosystems: Pulsar - Delta Lake Connector.
In this session, we will first provide an overview of Pulsar - Delta Lake connector and Delta Lake Standalone Reader, then introduce the design of Pulsar - Delta Lake CDC source connector including how to capture data change of Delta lake and how to recover from last checkpoint with the help of Pulsar Function state store . We will also discuss the scalability of this Pulsar - Delta lake CDC source connector.
4. Delta Lake
Background
• Delta lake is an open-source project that
enables building a Lakehouse
Architecture on top of existing storage
systems such as S3, ADLS, GCS, and
HDFS.
• Key Features:
• Open Format(Apache Parquet)
• Time Travel (data versioning)
• Scalable Metadata Handling
• Schema Evolution
• Delta Sharing
• ACID Transactions
• ...
7. Why Delta lake source CDC connector
Background
• Change Data Capture(CDC)
• CDC is the process of recognising when data has been changed in a source
system so a downstream process or system can action that change.
• Benifits:
• process incrementally
• keep in sync
• decoupling systems
14. Capture Delta Lake Data Changes
Design
Desription Example value
The operations of data changes. ‘c’ for add
‘r’ for delete
Delta table partition value. date=2019-01-01
The timestamp when data changes is
captured by the source connector
1559208957692
The timestamp when the data
changes happen in delta lake.
1559208951000
15. Schema transform and revolution
Design
How Pulsar get Schema From Pulsar IO
Connector?
20. Scalability
Design
How to distribute work on source connector
instances?
Instance -> Pulsar topic Partitions:
1: N
Delta Lake Partitions -> Pulsar topic Partitions
N: 1
Delta Lake Partitions -> Instance
N: 1
21. Checkpoint
Design
Checkpoint is used to recover from last position to continue to CDC.
Checkpoint is stored with the usage of
Pulsar State Store.
Every topic partition has a checkpoint, will
save periodly.
<delta snapshot version, file change index, row number index, pulsar sequence number>
Checkpoint