Building Reliable Lakehouses with Apache Flink and Delta Lake

Building Reliable Data Lakes with
Apache Flink and Delta Lake
Scott Sandre and Denny Lee

About Scott
Software Engineer, Databricks
● Part of the Delta Ecosystem team
● Lead development of Delta Standalone project
● Co-developer of Flink/Delta Source and Sink
● Bachelors of Software Engineering at University
of Waterloo Scott chilling with the stereotypical
SF backdrop

About Denny
Sr Staff Developer Advocate, Databricks
Have worked on Spark since 0.6 and Delta
Lake since its inception
Previously:
● Senior Director of Data Science Engineering at
SAP Concur
● Principal Program Manager at at Microsoft for
Azure Cosmos DB, Project Isotope (Azure
HDInsight), SQL Server
Denny surprised that he’s still
awake

an open-source storage format that brings
ACID transactions to big data workloads on
cloud object stores
the key ingredient of building Lakehouses

Evolution of Data Management
Why do we need Lakehouses?

Data Warehouses
were purpose-built
for BI and reporting,
however…
▪ No support for video, audio, text
▪ No support for data science, ML
▪ Limited support for streaming
▪ Closed & proprietary formats
▪ Expensive to scale out
Therefore, most data is stored in data
lakes & blob stores
ETL
External Data Operational Data
Data Warehouses
BI Reports

Data Lakes
could handle all your data
for data science and ML,
however…
▪ Poor BI support
▪ Complex to set up
▪ Poor performance
▪ Unreliable data swamps
BI
Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL

Coexistence is not a desirable strategy
BI
Data
Science
Machine
Learning
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL
ETL
External Data Operational Data
Data Warehouses
BI Reports

Lakehouse – best of both workloads
Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
Data

Streaming
Analytics
BI Data
Science
Machine
Learning
Data
Data Lake for all your data
One platform for every use case
Structured transactional layer
Lakehouse

Lakehouse
Streaming
Analytics
BI Data
Science
Machine
Learning
Data
Scalable, open, general purpose
transactional data format

Lakehouse
High perf query engine(s)
Streaming
Analytics
BI Data
Science
Machine
Learning
Data
Scalable, open, general purpose
transactional data format

Architecture
Why is it scalable?

Scalable storage
Scalable transaction log
+
=
for metadata
for data

Scalable storage
Scalable transaction log
pathToTable/
+---- 000.parquet
+---- 001.parquet
+---- 002.parquet
+ ...
table data stored as Parquet files
on cloud storage
sequence of metadata files to track
operations made on files in the table
stored in cloud storage along with table
read and process metadata in parallel
|
+---- _delta_log/
+---- 000.json
+---- 001.json
...

|
+---- _delta_log/
+---- 000.json
+---- 001.json
...
Transaction Log Commits
Changes to the table are
stored as ordered,
atomic commits
Each commit is JSON
file in _delta_log with a
set of actions
Add 001.parquet
Add 002.parquet
Remove 001.parquet
Remove 002.parquet
Add 003.parquet
UPDATE actions
INSERT actions

|
+---- _delta_log/
+---- 000.json
+---- 001.json
...
Consistent Snapshots
UPDATE actions
INSERT actions
Readers read the log
in atomic units thus
reading consistent
snapshots
readers will read either
● [001+002].parquet or
● 003.parquet
and nothing in-between
Add 001.parquet
Add 002.parquet
Remove 001.parquet
Remove 002.parquet
Add 003.parquet

ACID via Mutual Exclusion on Log
Commits
Concurrent writers need to agree
on the order of changes
(optimistic concurrency control)
New commit files must be created
mutually exclusively using
storage-specific API guarantees
000.json
001.json
002.json
Writer 1 Writer 2
only one of the writers trying
to concurrently write 002.json
must succeed

Storage system support
Delta relies on scalable cloud storage
infra for ACID guarantees =>
No single point of failure
Production-ready
Storage systems supported
HDFS, Azure, GCS: has mutex out-of-the-box
S3: using DynamoDB (in Delta 1.2)
000.json
001.json
002.json
Writer 1 Writer 2
only one of the writers trying
to concurrently write 002.json
must succeed

Features
What makes it unique?

Constraints and
Generated Columns
Ensure data always meets
semantic requirements
Time Travel
Access/revert to earlier
versions of data for audits,
rollbacks, or reproduce
Delta Lake Key Features
ACID Transactions
Protect your data with
serializability, the strongest
level of isolation.
Scalable Metadata
Handle petabyte-scale
tables with billions of
partitions and files at ease
Unified
Batch/Streaming
Exactly once semantics
ingestion to backfill to
interactive queries
Schema Evolution /
Enforcement
Prevent bad data from
causing data corruption
Audit History
Delta Lake log all change
details providing a full
audit trail
DML Operations
SQL, Scala/Java and
Python APIs to merge,
update and delete
datasets

Delta’s expanding ecosystem of connectors

Delta Standalone
Basis of almost all non-Spark connectors
Delta Standalone
Pure, non-Spark Java library to read/write Delta logs
https://github.com/delta-io/connectors#delta-standalone
built
using

Native Flink Delta Lake Connector

Native Flink Delta Lake Connector
Key components
• Delta Writer: Manage bucket writers for partitioned tables and pass incoming events
to the correct bucket writer.
• Delta Committable: This committable is either for one pending file to commit or one
in-progress file to clean up.
• Delta Committer: Responsible for committing the “pending” files and moving them to
a “finished” state, so they can be consumed by downstream applications or systems.
• Delta Global Committer: The Global Committer combines multiple lists of
DeltaCommittables received from multiple DeltaCommitters and commits all files to
the Delta log.

Flink: Delta Sink
Available since Delta Connectors 0.4
Writes from DataStream<RowData>
in batch or streaming modes
Supports writing by table path on
ADLS, GCS and S3
Support for S3 multi-cluster using
DynamoDB coming in Connectors 0.5
Gives exactly once guarantees with
replayable sources
DeltaSink<RowData> deltaSink = DeltaSink
.forRowData(path, hadoopConf, rowType)
.withPartitionColumns(...)
.build();
datastream.sinkTo(deltaSink);

Flink: Delta Source
Coming with Delta Connectors 0.5
Reads as DataStream<RowData>
in bounded or continuous mode
For bounded, supports querying old
table versions (aka Time Travel)
For continuous, supports reading
full table + changes, OR only
changes since a version
Supports all file systems*
Support for catalog tables + SQL
+ Table API in progress
DeltaSource
.forBoundedRowData(path, hadoopConf)
.build();
// Time travel
DeltaSource
.forBoundedRowData(path, hadoopConf)
.timestampAsOf("2022-02-24 04:55:00")
.build();
// Streaming
DeltaSource
.forContinuousRowData(path, hadoopConf)
.build();

Delta
Community
The best part about Delta!

©2022 Databricks Inc. — All rights reserved
The most widely used lakehouse format in the world
2021 2022
7M Downloads/Month
Delta 1.2
delta.rs 0.4
delta.rs
Python 0.5
delta.rs
Python 0.5.7
Delta Connectors 0.4
Delta Connectors 0.3
kafka-delta-ingest
0.3
Delta Sharing 0.1
Delta 1.0
650K/months
Delta Sharing 0.2
Delta Sharing 0.3
Delta 1.1
Delta Sharing 0.4

We could not do this without the community!

Join the community today!
delta.io
go.delta.io/github
go.delta.io/slack
go.delta.io/twitter

Building Reliable Lakehouses with Apache Flink and Delta Lake

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Building Reliable Lakehouses with Apache Flink and Delta Lake

Semelhante a Building Reliable Lakehouses with Apache Flink and Delta Lake (20)

Mais de Flink Forward

Mais de Flink Forward (20)

Último

Último (20)

Building Reliable Lakehouses with Apache Flink and Delta Lake