SlideShare uma empresa Scribd logo
1 de 31
Building Reliable Data Lakes with
Apache Flink and Delta Lake
Scott Sandre and Denny Lee
About Scott
Software Engineer, Databricks
● Part of the Delta Ecosystem team
● Lead development of Delta Standalone project
● Co-developer of Flink/Delta Source and Sink
● Bachelors of Software Engineering at University
of Waterloo Scott chilling with the stereotypical
SF backdrop
About Denny
Sr Staff Developer Advocate, Databricks
Have worked on Spark since 0.6 and Delta
Lake since its inception
Previously:
● Senior Director of Data Science Engineering at
SAP Concur
● Principal Program Manager at at Microsoft for
Azure Cosmos DB, Project Isotope (Azure
HDInsight), SQL Server
Denny surprised that he’s still
awake
an open-source storage format that brings
ACID transactions to big data workloads on
cloud object stores
the key ingredient of building Lakehouses
Evolution of Data Management
Why do we need Lakehouses?
Data Warehouses
were purpose-built
for BI and reporting,
however…
▪ No support for video, audio, text
▪ No support for data science, ML
▪ Limited support for streaming
▪ Closed & proprietary formats
▪ Expensive to scale out
Therefore, most data is stored in data
lakes & blob stores
ETL
External Data Operational Data
Data Warehouses
BI Reports
Data Lakes
could handle all your data
for data science and ML,
however…
▪ Poor BI support
▪ Complex to set up
▪ Poor performance
▪ Unreliable data swamps
BI
Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL
Coexistence is not a desirable strategy
BI
Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL
ETL
External Data Operational Data
Data Warehouses
BI Reports
Lakehouse – best of both workloads
Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake for all your data
One platform for every use case
Structured transactional layer
Lakehouse
Lakehouse
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake for all your data
One platform for every use case
Scalable, open, general purpose
transactional data format
Lakehouse
High perf query engine(s)
One platform for every use case
Streaming
Analytics
BI Data
Science
Machine
Learning
Data Lake for all your data
Structured, Semi-Structured and Unstructured
Data
Scalable, open, general purpose
transactional data format
Architecture
Why is it scalable?
Scalable storage
Scalable transaction log
+
=
for metadata
for data
Scalable storage
Scalable transaction log
pathToTable/
+---- 000.parquet
+---- 001.parquet
+---- 002.parquet
+ ...
table data stored as Parquet files
on cloud storage
sequence of metadata files to track
operations made on files in the table
stored in cloud storage along with table
read and process metadata in parallel
|
+---- _delta_log/
+---- 000.json
+---- 001.json
...
|
+---- _delta_log/
+---- 000.json
+---- 001.json
...
Transaction Log Commits
Changes to the table are
stored as ordered,
atomic commits
Each commit is JSON
file in _delta_log with a
set of actions
Add 001.parquet
Add 002.parquet
Remove 001.parquet
Remove 002.parquet
Add 003.parquet
UPDATE actions
INSERT actions
|
+---- _delta_log/
+---- 000.json
+---- 001.json
...
Consistent Snapshots
UPDATE actions
INSERT actions
Readers read the log
in atomic units thus
reading consistent
snapshots
readers will read either
● [001+002].parquet or
● 003.parquet
and nothing in-between
Add 001.parquet
Add 002.parquet
Remove 001.parquet
Remove 002.parquet
Add 003.parquet
ACID via Mutual Exclusion on Log
Commits
Concurrent writers need to agree
on the order of changes
(optimistic concurrency control)
New commit files must be created
mutually exclusively using
storage-specific API guarantees
000.json
001.json
002.json
Writer 1 Writer 2
only one of the writers trying
to concurrently write 002.json
must succeed
Storage system support
Delta relies on scalable cloud storage
infra for ACID guarantees =>
No single point of failure
Production-ready
Storage systems supported
HDFS, Azure, GCS: has mutex out-of-the-box
S3: using DynamoDB (in Delta 1.2)
000.json
001.json
002.json
Writer 1 Writer 2
only one of the writers trying
to concurrently write 002.json
must succeed
Features
What makes it unique?
Constraints and
Generated Columns
Ensure data always meets
semantic requirements
Time Travel
Access/revert to earlier
versions of data for audits,
rollbacks, or reproduce
Delta Lake Key Features
ACID Transactions
Protect your data with
serializability, the strongest
level of isolation.
Scalable Metadata
Handle petabyte-scale
tables with billions of
partitions and files at ease
Unified
Batch/Streaming
Exactly once semantics
ingestion to backfill to
interactive queries
Schema Evolution /
Enforcement
Prevent bad data from
causing data corruption
Audit History
Delta Lake log all change
details providing a full
audit trail
DML Operations
SQL, Scala/Java and
Python APIs to merge,
update and delete
datasets
Delta’s expanding ecosystem of connectors
Delta Standalone
Basis of almost all non-Spark connectors
Delta Standalone
Pure, non-Spark Java library to read/write Delta logs
https://github.com/delta-io/connectors#delta-standalone
built
using
Native Flink Delta Lake Connector
Native Flink Delta Lake Connector
Key components
• Delta Writer: Manage bucket writers for partitioned tables and pass incoming events
to the correct bucket writer.
• Delta Committable: This committable is either for one pending file to commit or one
in-progress file to clean up.
• Delta Committer: Responsible for committing the “pending” files and moving them to
a “finished” state, so they can be consumed by downstream applications or systems.
• Delta Global Committer: The Global Committer combines multiple lists of
DeltaCommittables received from multiple DeltaCommitters and commits all files to
the Delta log.
Flink: Delta Sink
Available since Delta Connectors 0.4
Writes from DataStream<RowData>
in batch or streaming modes
Supports writing by table path on
ADLS, GCS and S3
Support for S3 multi-cluster using
DynamoDB coming in Connectors 0.5
Gives exactly once guarantees with
replayable sources
DeltaSink<RowData> deltaSink = DeltaSink
.forRowData(path, hadoopConf, rowType)
.withPartitionColumns(...)
.build();
datastream.sinkTo(deltaSink);
Flink: Delta Source
Coming with Delta Connectors 0.5
Reads as DataStream<RowData>
in bounded or continuous mode
For bounded, supports querying old
table versions (aka Time Travel)
For continuous, supports reading
full table + changes, OR only
changes since a version
Supports all file systems*
Support for catalog tables + SQL
+ Table API in progress
DeltaSource
.forBoundedRowData(path, hadoopConf)
.build();
// Time travel
DeltaSource
.forBoundedRowData(path, hadoopConf)
.timestampAsOf("2022-02-24 04:55:00")
.build();
// Streaming
DeltaSource
.forContinuousRowData(path, hadoopConf)
.build();
Delta
Community
The best part about Delta!
©2022 Databricks Inc. — All rights reserved
The most widely used lakehouse format in the world
2021 2022
7M Downloads/Month
Delta 1.2
delta.rs 0.4
delta.rs
Python 0.5
delta.rs
Python 0.5.7
Delta Connectors 0.4
Delta Connectors 0.3
kafka-delta-ingest
0.3
Delta Sharing 0.1
Delta 1.0
650K/months
Delta Sharing 0.2
Delta Sharing 0.3
Delta 1.1
Delta Sharing 0.4
©2022 Databricks Inc. — All rights reserved
We could not do this without the community!
©2022 Databricks Inc. — All rights reserved
Join the community today!
delta.io
go.delta.io/github
go.delta.io/slack
go.delta.io/twitter

Mais conteúdo relacionado

Mais procurados

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium confluent
 

Mais procurados (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 

Semelhante a Building Reliable Lakehouses with Apache Flink and Delta Lake

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...StreamNative
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...StreamNative
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...StreamNative
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Informatica Power Center 7.1
Informatica Power Center 7.1Informatica Power Center 7.1
Informatica Power Center 7.1ganblues
 
Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachDatabricks
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
 
OPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONOPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONSUMIT KUMAR
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionTravis Wright
 

Semelhante a Building Reliable Lakehouses with Apache Flink and Delta Lake (20)

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
What Is Delta Lake ???
What Is Delta Lake ???What Is Delta Lake ???
What Is Delta Lake ???
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Informatica Power Center 7.1
Informatica Power Center 7.1Informatica Power Center 7.1
Informatica Power Center 7.1
 
Speeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT ApproachSpeeding Time to Insight with a Modern ELT Approach
Speeding Time to Insight with a Modern ELT Approach
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
OPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONOPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATION
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
 

Mais de Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 

Mais de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 

Último

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Building Reliable Lakehouses with Apache Flink and Delta Lake

  • 1. Building Reliable Data Lakes with Apache Flink and Delta Lake Scott Sandre and Denny Lee
  • 2. About Scott Software Engineer, Databricks ● Part of the Delta Ecosystem team ● Lead development of Delta Standalone project ● Co-developer of Flink/Delta Source and Sink ● Bachelors of Software Engineering at University of Waterloo Scott chilling with the stereotypical SF backdrop
  • 3. About Denny Sr Staff Developer Advocate, Databricks Have worked on Spark since 0.6 and Delta Lake since its inception Previously: ● Senior Director of Data Science Engineering at SAP Concur ● Principal Program Manager at at Microsoft for Azure Cosmos DB, Project Isotope (Azure HDInsight), SQL Server Denny surprised that he’s still awake
  • 4. an open-source storage format that brings ACID transactions to big data workloads on cloud object stores the key ingredient of building Lakehouses
  • 5. Evolution of Data Management Why do we need Lakehouses?
  • 6. Data Warehouses were purpose-built for BI and reporting, however… ▪ No support for video, audio, text ▪ No support for data science, ML ▪ Limited support for streaming ▪ Closed & proprietary formats ▪ Expensive to scale out Therefore, most data is stored in data lakes & blob stores ETL External Data Operational Data Data Warehouses BI Reports
  • 7. Data Lakes could handle all your data for data science and ML, however… ▪ Poor BI support ▪ Complex to set up ▪ Poor performance ▪ Unreliable data swamps BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Prep and Validation ETL
  • 8. Coexistence is not a desirable strategy BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Prep and Validation ETL ETL External Data Operational Data Data Warehouses BI Reports
  • 9. Lakehouse – best of both workloads Data Warehouse Data Lake Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data
  • 10. Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake for all your data One platform for every use case Structured transactional layer Lakehouse
  • 11. Lakehouse Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake for all your data One platform for every use case Scalable, open, general purpose transactional data format
  • 12. Lakehouse High perf query engine(s) One platform for every use case Streaming Analytics BI Data Science Machine Learning Data Lake for all your data Structured, Semi-Structured and Unstructured Data Scalable, open, general purpose transactional data format
  • 14. Scalable storage Scalable transaction log + = for metadata for data
  • 15. Scalable storage Scalable transaction log pathToTable/ +---- 000.parquet +---- 001.parquet +---- 002.parquet + ... table data stored as Parquet files on cloud storage sequence of metadata files to track operations made on files in the table stored in cloud storage along with table read and process metadata in parallel | +---- _delta_log/ +---- 000.json +---- 001.json ...
  • 16. | +---- _delta_log/ +---- 000.json +---- 001.json ... Transaction Log Commits Changes to the table are stored as ordered, atomic commits Each commit is JSON file in _delta_log with a set of actions Add 001.parquet Add 002.parquet Remove 001.parquet Remove 002.parquet Add 003.parquet UPDATE actions INSERT actions
  • 17. | +---- _delta_log/ +---- 000.json +---- 001.json ... Consistent Snapshots UPDATE actions INSERT actions Readers read the log in atomic units thus reading consistent snapshots readers will read either ● [001+002].parquet or ● 003.parquet and nothing in-between Add 001.parquet Add 002.parquet Remove 001.parquet Remove 002.parquet Add 003.parquet
  • 18. ACID via Mutual Exclusion on Log Commits Concurrent writers need to agree on the order of changes (optimistic concurrency control) New commit files must be created mutually exclusively using storage-specific API guarantees 000.json 001.json 002.json Writer 1 Writer 2 only one of the writers trying to concurrently write 002.json must succeed
  • 19. Storage system support Delta relies on scalable cloud storage infra for ACID guarantees => No single point of failure Production-ready Storage systems supported HDFS, Azure, GCS: has mutex out-of-the-box S3: using DynamoDB (in Delta 1.2) 000.json 001.json 002.json Writer 1 Writer 2 only one of the writers trying to concurrently write 002.json must succeed
  • 21. Constraints and Generated Columns Ensure data always meets semantic requirements Time Travel Access/revert to earlier versions of data for audits, rollbacks, or reproduce Delta Lake Key Features ACID Transactions Protect your data with serializability, the strongest level of isolation. Scalable Metadata Handle petabyte-scale tables with billions of partitions and files at ease Unified Batch/Streaming Exactly once semantics ingestion to backfill to interactive queries Schema Evolution / Enforcement Prevent bad data from causing data corruption Audit History Delta Lake log all change details providing a full audit trail DML Operations SQL, Scala/Java and Python APIs to merge, update and delete datasets
  • 23. Delta Standalone Basis of almost all non-Spark connectors Delta Standalone Pure, non-Spark Java library to read/write Delta logs https://github.com/delta-io/connectors#delta-standalone built using
  • 24. Native Flink Delta Lake Connector
  • 25. Native Flink Delta Lake Connector Key components • Delta Writer: Manage bucket writers for partitioned tables and pass incoming events to the correct bucket writer. • Delta Committable: This committable is either for one pending file to commit or one in-progress file to clean up. • Delta Committer: Responsible for committing the “pending” files and moving them to a “finished” state, so they can be consumed by downstream applications or systems. • Delta Global Committer: The Global Committer combines multiple lists of DeltaCommittables received from multiple DeltaCommitters and commits all files to the Delta log.
  • 26. Flink: Delta Sink Available since Delta Connectors 0.4 Writes from DataStream<RowData> in batch or streaming modes Supports writing by table path on ADLS, GCS and S3 Support for S3 multi-cluster using DynamoDB coming in Connectors 0.5 Gives exactly once guarantees with replayable sources DeltaSink<RowData> deltaSink = DeltaSink .forRowData(path, hadoopConf, rowType) .withPartitionColumns(...) .build(); datastream.sinkTo(deltaSink);
  • 27. Flink: Delta Source Coming with Delta Connectors 0.5 Reads as DataStream<RowData> in bounded or continuous mode For bounded, supports querying old table versions (aka Time Travel) For continuous, supports reading full table + changes, OR only changes since a version Supports all file systems* Support for catalog tables + SQL + Table API in progress DeltaSource .forBoundedRowData(path, hadoopConf) .build(); // Time travel DeltaSource .forBoundedRowData(path, hadoopConf) .timestampAsOf("2022-02-24 04:55:00") .build(); // Streaming DeltaSource .forContinuousRowData(path, hadoopConf) .build();
  • 29. ©2022 Databricks Inc. — All rights reserved The most widely used lakehouse format in the world 2021 2022 7M Downloads/Month Delta 1.2 delta.rs 0.4 delta.rs Python 0.5 delta.rs Python 0.5.7 Delta Connectors 0.4 Delta Connectors 0.3 kafka-delta-ingest 0.3 Delta Sharing 0.1 Delta 1.0 650K/months Delta Sharing 0.2 Delta Sharing 0.3 Delta 1.1 Delta Sharing 0.4
  • 30. ©2022 Databricks Inc. — All rights reserved We could not do this without the community!
  • 31. ©2022 Databricks Inc. — All rights reserved Join the community today! delta.io go.delta.io/github go.delta.io/slack go.delta.io/twitter