Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
2. About Scott
Software Engineer, Databricks
● Part of the Delta Ecosystem team
● Lead development of Delta Standalone project
● Co-developer of Flink/Delta Source and Sink
● Bachelors of Software Engineering at University
of Waterloo Scott chilling with the stereotypical
SF backdrop
3. About Denny
Sr Staff Developer Advocate, Databricks
Have worked on Spark since 0.6 and Delta
Lake since its inception
Previously:
● Senior Director of Data Science Engineering at
SAP Concur
● Principal Program Manager at at Microsoft for
Azure Cosmos DB, Project Isotope (Azure
HDInsight), SQL Server
Denny surprised that he’s still
awake
4. an open-source storage format that brings
ACID transactions to big data workloads on
cloud object stores
the key ingredient of building Lakehouses
6. Data Warehouses
were purpose-built
for BI and reporting,
however…
▪ No support for video, audio, text
▪ No support for data science, ML
▪ Limited support for streaming
▪ Closed & proprietary formats
▪ Expensive to scale out
Therefore, most data is stored in data
lakes & blob stores
ETL
External Data Operational Data
Data Warehouses
BI Reports
7. Data Lakes
could handle all your data
for data science and ML,
however…
▪ Poor BI support
▪ Complex to set up
▪ Poor performance
▪ Unreliable data swamps
BI
Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL
8. Coexistence is not a desirable strategy
BI
Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL
ETL
External Data Operational Data
Data Warehouses
BI Reports
9. Lakehouse – best of both workloads
Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
12. Lakehouse
High perf query engine(s)
One platform for every use case
Streaming
Analytics
BI Data
Science
Machine
Learning
Data Lake for all your data
Structured, Semi-Structured and Unstructured
Data
Scalable, open, general purpose
transactional data format
15. Scalable storage
Scalable transaction log
pathToTable/
+---- 000.parquet
+---- 001.parquet
+---- 002.parquet
+ ...
table data stored as Parquet files
on cloud storage
sequence of metadata files to track
operations made on files in the table
stored in cloud storage along with table
read and process metadata in parallel
|
+---- _delta_log/
+---- 000.json
+---- 001.json
...
16. |
+---- _delta_log/
+---- 000.json
+---- 001.json
...
Transaction Log Commits
Changes to the table are
stored as ordered,
atomic commits
Each commit is JSON
file in _delta_log with a
set of actions
Add 001.parquet
Add 002.parquet
Remove 001.parquet
Remove 002.parquet
Add 003.parquet
UPDATE actions
INSERT actions
17. |
+---- _delta_log/
+---- 000.json
+---- 001.json
...
Consistent Snapshots
UPDATE actions
INSERT actions
Readers read the log
in atomic units thus
reading consistent
snapshots
readers will read either
● [001+002].parquet or
● 003.parquet
and nothing in-between
Add 001.parquet
Add 002.parquet
Remove 001.parquet
Remove 002.parquet
Add 003.parquet
18. ACID via Mutual Exclusion on Log
Commits
Concurrent writers need to agree
on the order of changes
(optimistic concurrency control)
New commit files must be created
mutually exclusively using
storage-specific API guarantees
000.json
001.json
002.json
Writer 1 Writer 2
only one of the writers trying
to concurrently write 002.json
must succeed
19. Storage system support
Delta relies on scalable cloud storage
infra for ACID guarantees =>
No single point of failure
Production-ready
Storage systems supported
HDFS, Azure, GCS: has mutex out-of-the-box
S3: using DynamoDB (in Delta 1.2)
000.json
001.json
002.json
Writer 1 Writer 2
only one of the writers trying
to concurrently write 002.json
must succeed
21. Constraints and
Generated Columns
Ensure data always meets
semantic requirements
Time Travel
Access/revert to earlier
versions of data for audits,
rollbacks, or reproduce
Delta Lake Key Features
ACID Transactions
Protect your data with
serializability, the strongest
level of isolation.
Scalable Metadata
Handle petabyte-scale
tables with billions of
partitions and files at ease
Unified
Batch/Streaming
Exactly once semantics
ingestion to backfill to
interactive queries
Schema Evolution /
Enforcement
Prevent bad data from
causing data corruption
Audit History
Delta Lake log all change
details providing a full
audit trail
DML Operations
SQL, Scala/Java and
Python APIs to merge,
update and delete
datasets
23. Delta Standalone
Basis of almost all non-Spark connectors
Delta Standalone
Pure, non-Spark Java library to read/write Delta logs
https://github.com/delta-io/connectors#delta-standalone
built
using
25. Native Flink Delta Lake Connector
Key components
• Delta Writer: Manage bucket writers for partitioned tables and pass incoming events
to the correct bucket writer.
• Delta Committable: This committable is either for one pending file to commit or one
in-progress file to clean up.
• Delta Committer: Responsible for committing the “pending” files and moving them to
a “finished” state, so they can be consumed by downstream applications or systems.
• Delta Global Committer: The Global Committer combines multiple lists of
DeltaCommittables received from multiple DeltaCommitters and commits all files to
the Delta log.
26. Flink: Delta Sink
Available since Delta Connectors 0.4
Writes from DataStream<RowData>
in batch or streaming modes
Supports writing by table path on
ADLS, GCS and S3
Support for S3 multi-cluster using
DynamoDB coming in Connectors 0.5
Gives exactly once guarantees with
replayable sources
DeltaSink<RowData> deltaSink = DeltaSink
.forRowData(path, hadoopConf, rowType)
.withPartitionColumns(...)
.build();
datastream.sinkTo(deltaSink);
27. Flink: Delta Source
Coming with Delta Connectors 0.5
Reads as DataStream<RowData>
in bounded or continuous mode
For bounded, supports querying old
table versions (aka Time Travel)
For continuous, supports reading
full table + changes, OR only
changes since a version
Supports all file systems*
Support for catalog tables + SQL
+ Table API in progress
DeltaSource
.forBoundedRowData(path, hadoopConf)
.build();
// Time travel
DeltaSource
.forBoundedRowData(path, hadoopConf)
.timestampAsOf("2022-02-24 04:55:00")
.build();
// Streaming
DeltaSource
.forContinuousRowData(path, hadoopConf)
.build();