Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Paris Data Engineers !
Paris Data Engineers !Paris Data Engineers !
The Delta
Architecture
Quentin Ambard
quentin.ambard@databricks.com
Databricks Workspace
Collaborative Notebooks, production jobs & business insights
Managed platform
Cloud Native
Databricks: Unified Data Analytics Platform
ML Runtime
For your Big data and Machine Learning Lifecycle
...
● A typical Data Lake Architecture
● The Delta Architecture
● Inside Delta Lake
● Demo
The Delta Agenda
Enterprises have been spending millions
of dollars getting data into data lakes
Data Lake
The aspiration is to do data science and
ML on all that data using Apache Spark!
Data Lake
Data Science & ML
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Data Lake
But the data is not ready for data science & ML
The majority of these projects are failing due to
Complex pipeline and unreliable data!
Data Science & ML
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
What does a typical
data lake project look like?
Evolution of a Cutting-Edge Data Lake
Events
?
AI & Reporting
Streaming
Analytics
Data Lake
Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
Data Lake
Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1
Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1
21
1
2
Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Partitioned
1
2
3
1
1
3
2
Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates GDPR...
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
Reprocessing
Challenge #5: Stability at scale?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates GDPR...
Small filesPartitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
5
5
Reprocessing
Data reliability challenges with data lakes
No atomicity: failed jobs leaves data in
corrupt state requiring tedious recovery✗
No quality enforcement: creates inconsistent and low
quality data
Lack of consistency / isolation: makes it almost impossible
to mix delete, appends and reads, batch and streaming
Let’s try it instead with
● Open Format Based on Parquet
● By the creator of Apache Spark
● With Transactions
● Using Spark API’s
A New Standard for Building Data Lakes
Is there a better architecture?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates GDPR...
Small filesPartitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
5
5
Reprocessing
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to improve the quality of your
data until it is ready for consumption.
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Raw data with minimal parsing
Supports long retention (years)
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Intermediate data with some cleanup applied.
Schema enforcement/evolution, data expectation
Queryable for easy debugging!
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Clean data, ready for consumption.
Read with Spark, Presto, Glue*
*Coming Soon
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
• Full ACID Transactions
• Open Source (Apache License)
• Powered by
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver
CSV,
JSON,
TXT…
Kinesis
Streams move data through the Delta Lake
•Low-latency or manually triggered
•Eliminates management of schedules and jobs
Gold
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver
CSV,
JSON,
TXT…
Kinesis
Delta Lake also supports batch jobs
and standard DML while streams run
UPDATE
DELETE
MERGE
OVERWRITE
• Retention
• Corrections
• GDPR
INSERT
Gold
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver
CSV,
JSON,
TXT…
Kinesis
Easy to recompute when business logic changes:
• Clear tables
• Restart streams
DELETE DELETE
Gold
How do I use ?
dataframe
.write
.format("delta")
.save("/data")
Get Started with Delta using Spark APIs
dataframe
.write
.format("parquet")
.save("/data")
Instead of parquet... … simply say delta
Add Spark Package
pyspark --packages io.delta:delta-core_2.12:0.1.0
bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>0.1.0</version>
</dependency>
Maven
How does work?
Delta On Disk
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files
Log Structured Storage
Changes to the table
are stored as
ordered, atomic units
called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…
Handling Massive Metadata
Large tables can have millions of files in them! How do we scale
the metadata? Use Spark for scaling!
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint
…
0009.json
0010.json
checkpoint-1.parquet
0011.json
…
Transaction log
Transactional
Log
Parquet Files
Delta Lake ensures data reliability
Streaming
● ACID Transactions / full DML
● Data quality
● Unified Batch & Streaming
● Time Travel/Data Snapshots
Key Features
High Quality & Reliable Data
always ready for analytics
Batch
Updates/Deletes
Support concurrent operation
Notebook/User 1:
SELECT * FROM customers WHERE firstname='xxx'
Notebook/User 2:
INSERT INTO customers (firstname, …) VALUES ('marc', …)
Notebook/User 3:
DELETE FROM customers WHERE firstname='quentin'
Support concurrent operation
Isolation level: WriteSerializable
Delta solves conflict optimistically
Concurrent modifications on a table triggers a rollback
Upsert/Merge: Fine-grained Updates
MERGE INTO customers -- Delta table
USING updates
ON customers.customerId = source.customerId
WHEN MATCHED THEN
UPDATE SET address = updates.address
WHEN NOT MATCHED
THEN INSERT (customerId, address) VALUES (updates.customerId,
updates.address)
Ensure Data Quality*
Enforce metadata, schema, and quality declaratively.
Inserts will fail if data doesn’t respect schema or quality
table("warehouse")
.location(…) // Location on DBFS
.schema(my_schema) // Optional strict schema checking
.metastoreName(…) // Registration in Hive Metastore
.description(…) // Human readable description for users
*Coming Soon
.expect("validTimestamp", // Expectations on data quality*
"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")
Unified batch and streaming
Concurrent stream/batch with exactly-once processing guarantee
Data Lake
AI & Reporting
Streaming
Analytics
Join stream with
table/stream
Bronze Silver
CSV,
JSON,
TXT…
Kinesis
DELETE DELETE
Gold
SELECT count(*) FROM events
TIMESTAMP AS OF timestamp
SELECT count(*) FROM events
VERSION AS OF version
Time Travel
spark.read.format(" delta").option("timestampAsOf",
timestamp_string).load("/events/")
INSERT INTO my_table
SELECT * FROM my_table TIMESTAMP AS OF
date_sub( current_date(), 1)
Reproduce experiments & reports Rollback accidental bad writes
Demo time !
Workshop Delta & MLFlow
Jeudi 7 Novembre
9h-12h30
https://dbricks.co/workshop-databricks
1 de 42

Recomendados

Intro to Delta Lake por
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
1.5K visualizações22 slides
A Thorough Comparison of Delta Lake, Iceberg and Hudi por
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K visualizações27 slides
Delta lake and the delta architecture por
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
1K visualizações22 slides
Building Lakehouses on Delta Lake with SQL Analytics Primer por
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
428 visualizações32 slides
Making Apache Spark Better with Delta Lake por
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
5.4K visualizações40 slides
Databricks Delta Lake and Its Benefits por
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
5.1K visualizações21 slides

Mais conteúdo relacionado

Mais procurados

Spark with Delta Lake por
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
294 visualizações21 slides
The delta architecture por
The delta architectureThe delta architecture
The delta architecturePrakash Chockalingam
538 visualizações41 slides
Achieving Lakehouse Models with Spark 3.0 por
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
621 visualizações25 slides
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga... por
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
137 visualizações23 slides
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi... por
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
8.4K visualizações48 slides
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic por
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
74 visualizações42 slides

Mais procurados(20)

Spark with Delta Lake por Knoldus Inc.
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.294 visualizações
Achieving Lakehouse Models with Spark 3.0 por Databricks
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks621 visualizações
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga... por DataScienceConferenc1
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1137 visualizações
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi... por Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks8.4K visualizações
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic por DataScienceConferenc1
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc174 visualizações
Massive Data Processing in Adobe Using Delta Lake por Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 visualizações
Large Scale Lakehouse Implementation Using Structured Streaming por Databricks
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks489 visualizações
Change Data Feed in Delta por Databricks
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks1.6K visualizações
Architect’s Open-Source Guide for a Data Mesh Architecture por Databricks
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks3.1K visualizações
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021 por StreamNative
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative536 visualizações
Introduction SQL Analytics on Lakehouse Architecture por Databricks
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks5.8K visualizações
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop por Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K visualizações
Data Lakehouse Symposium | Day 4 por Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K visualizações
DW Migration Webinar-March 2022.pptx por Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K visualizações
Databricks Platform.pptx por Alex Ivy
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy3.3K visualizações
Common Strategies for Improving Performance on Your Delta Lakehouse por Databricks
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks716 visualizações
Iceberg: A modern table format for big data (Strata NY 2018) por Ryan Blue
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue2K visualizações
Delta Lake with Azure Databricks por Dustin Vannoy
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy418 visualizações

Similar a Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust por
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustData Con LA
1.5K visualizações40 slides
Delta Lake: Open Source Reliability w/ Apache Spark por
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow
235 visualizações39 slides
Building Reliable Data Lakes at Scale with Delta Lake por
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
2K visualizações23 slides
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... por
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
1.5K visualizações50 slides
Apache CarbonData+Spark to realize data convergence and Unified high performa... por
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
418 visualizações34 slides
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud por
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
1K visualizações36 slides

Similar a Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard(20)

Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust por Data Con LA
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA1.5K visualizações
Delta Lake: Open Source Reliability w/ Apache Spark por George Chow
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow235 visualizações
Building Reliable Data Lakes at Scale with Delta Lake por Databricks
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
Databricks2K visualizações
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... por Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks1.5K visualizações
Apache CarbonData+Spark to realize data convergence and Unified high performa... por Tech Triveni
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni418 visualizações
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud por Amazon Web Services
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services1K visualizações
Technical Deck Delta Live Tables.pdf por Ilham31574
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574236 visualizações
Simplify and Scale Data Engineering Pipelines with Delta Lake por Databricks
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks2.3K visualizações
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ... por Databricks
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
Databricks1.1K visualizações
Cloud Experience: Data-driven Applications Made Simple and Fast por Databricks
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks341 visualizações
Data & analytics challenges in a microservice architecture por Niels Naglé
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
Niels Naglé436 visualizações
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ... por Databricks
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks444 visualizações
Continuous Intelligence - Intersecting Event-Based Business Logic and ML por Paris Carbone
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone317 visualizações
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks por Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks892 visualizações
What to Expect for Big Data and Apache Spark in 2017 por Databricks
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks4.2K visualizações
Cloud-based Data Lake for Analytics and AI por Torsten Steinbach
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
Torsten Steinbach170 visualizações
First in Class: Optimizing the Data Lake for Tighter Integration por Inside Analysis
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
Inside Analysis793 visualizações
Big Data Analytics Platforms by KTH and RISE SICS por Big Data Value Association
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association114 visualizações
Azure Stream Analytics : Analyse Data in Motion por Ruhani Arora
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
Ruhani Arora694 visualizações
Spark + AI Summit 2020 イベント概要 por Paulo Gutierrez
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez457 visualizações

Mais de Paris Data Engineers !

Spark tools by Jonathan Winandy por
Spark tools by Jonathan WinandySpark tools by Jonathan Winandy
Spark tools by Jonathan WinandyParis Data Engineers !
215 visualizações19 slides
SCIO : Apache Beam API por
SCIO : Apache Beam APISCIO : Apache Beam API
SCIO : Apache Beam APIParis Data Engineers !
67 visualizações14 slides
Apache Beam de A à Z por
 Apache Beam de A à Z Apache Beam de A à Z
Apache Beam de A à ZParis Data Engineers !
291 visualizações47 slides
REX : pourquoi et comment développer son propre scheduler por
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerParis Data Engineers !
73 visualizações22 slides
Deeplearning in production por
Deeplearning in productionDeeplearning in production
Deeplearning in productionParis Data Engineers !
54 visualizações46 slides
Utilisation de MLflow pour le cycle de vie des projet Machine learning por
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningParis Data Engineers !
200 visualizações52 slides

Mais de Paris Data Engineers !(11)

REX : pourquoi et comment développer son propre scheduler por Paris Data Engineers !
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre scheduler
Paris Data Engineers !73 visualizações
Utilisation de MLflow pour le cycle de vie des projet Machine learning por Paris Data Engineers !
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Paris Data Engineers !200 visualizações
10 things i wish i'd known before using spark in production por Paris Data Engineers !
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !374 visualizações
Change Data Capture with Data Collector @OVH por Paris Data Engineers !
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
Paris Data Engineers !201 visualizações
Building highly reliable data pipeline @datadog par Quentin François por Paris Data Engineers !
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
Paris Data Engineers !554 visualizações
Scala pour le Data Engineering par Jonathan Winandy por Paris Data Engineers !
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan Winandy
Paris Data Engineers !100 visualizações

Último

Transcript: The Details of Description Techniques tips and tangents on altern... por
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...BookNet Canada
136 visualizações15 slides
Kyo - Functional Scala 2023.pdf por
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
368 visualizações92 slides
The Research Portal of Catalonia: Growing more (information) & more (services) por
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
80 visualizações25 slides
SAP Automation Using Bar Code and FIORI.pdf por
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
23 visualizações38 slides
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive por
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveNetwork Automation Forum
31 visualizações35 slides
Democratising digital commerce in India-Report por
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-ReportKapil Khandelwal (KK)
15 visualizações161 slides

Último(20)

Transcript: The Details of Description Techniques tips and tangents on altern... por BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada136 visualizações
Kyo - Functional Scala 2023.pdf por Flavio W. Brasil
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
Flavio W. Brasil368 visualizações
SAP Automation Using Bar Code and FIORI.pdf por Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Virendra Rai, PMP23 visualizações
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive por Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Network Automation Forum31 visualizações
Democratising digital commerce in India-Report por Kapil Khandelwal (KK)
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-Report
Kapil Khandelwal (KK)15 visualizações
STPI OctaNE CoE Brochure.pdf por madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 visualizações
Data Integrity for Banking and Financial Services por Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely21 visualizações
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... por Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker37 visualizações
Voice Logger - Telephony Integration Solution at Aegis por Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 visualizações
handbook for web 3 adoption.pdf por Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 visualizações
AMAZON PRODUCT RESEARCH.pdf por JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta26 visualizações
Evolving the Network Automation Journey from Python to Platforms por Network Automation Forum
Evolving the Network Automation Journey from Python to PlatformsEvolving the Network Automation Journey from Python to Platforms
Evolving the Network Automation Journey from Python to Platforms
Network Automation Forum13 visualizações
Network Source of Truth and Infrastructure as Code revisited por Network Automation Forum
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisited
Network Automation Forum26 visualizações
Uni Systems for Power Platform.pptx por Uni Systems S.M.S.A.
Uni Systems for Power Platform.pptxUni Systems for Power Platform.pptx
Uni Systems for Power Platform.pptx
Uni Systems S.M.S.A.56 visualizações
Tunable Laser (1).pptx por Hajira Mahmood
Tunable Laser (1).pptxTunable Laser (1).pptx
Tunable Laser (1).pptx
Hajira Mahmood24 visualizações
PRODUCT LISTING.pptx por angelicacueva6
PRODUCT LISTING.pptxPRODUCT LISTING.pptx
PRODUCT LISTING.pptx
angelicacueva614 visualizações
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... por James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 visualizações
The details of description: Techniques, tips, and tangents on alternative tex... por BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada127 visualizações

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

  • 2. Databricks Workspace Collaborative Notebooks, production jobs & business insights Managed platform Cloud Native Databricks: Unified Data Analytics Platform ML Runtime For your Big data and Machine Learning Lifecycle ...
  • 3. ● A typical Data Lake Architecture ● The Delta Architecture ● Inside Delta Lake ● Demo The Delta Agenda
  • 4. Enterprises have been spending millions of dollars getting data into data lakes Data Lake
  • 5. The aspiration is to do data science and ML on all that data using Apache Spark! Data Lake Data Science & ML • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing
  • 6. Data Lake But the data is not ready for data science & ML The majority of these projects are failing due to Complex pipeline and unreliable data! Data Science & ML • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing
  • 7. What does a typical data lake project look like?
  • 8. Evolution of a Cutting-Edge Data Lake Events ? AI & Reporting Streaming Analytics Data Lake
  • 9. Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics Data Lake
  • 10. Challenge #1: Historical Queries? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events λ-arch1 1 1
  • 11. Challenge #2: Messy Data? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation 1 21 1 2
  • 12. Reprocessing Challenge #3: Mistakes and Failures? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Partitioned 1 2 3 1 1 3 2
  • 13. Challenge #4: Updates? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates GDPR... Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 Reprocessing
  • 14. Challenge #5: Stability at scale? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates GDPR... Small filesPartitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 5 5 Reprocessing
  • 15. Data reliability challenges with data lakes No atomicity: failed jobs leaves data in corrupt state requiring tedious recovery✗ No quality enforcement: creates inconsistent and low quality data Lack of consistency / isolation: makes it almost impossible to mix delete, appends and reads, batch and streaming
  • 16. Let’s try it instead with
  • 17. ● Open Format Based on Parquet ● By the creator of Apache Spark ● With Transactions ● Using Spark API’s A New Standard for Building Data Lakes
  • 18. Is there a better architecture? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates GDPR... Small filesPartitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 5 5 Reprocessing
  • 19. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis
  • 20. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to improve the quality of your data until it is ready for consumption.
  • 21. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Raw data with minimal parsing Supports long retention (years)
  • 22. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Intermediate data with some cleanup applied. Schema enforcement/evolution, data expectation Queryable for easy debugging!
  • 23. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Clean data, ready for consumption. Read with Spark, Presto, Glue* *Coming Soon
  • 24. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis • Full ACID Transactions • Open Source (Apache License) • Powered by
  • 25. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver CSV, JSON, TXT… Kinesis Streams move data through the Delta Lake •Low-latency or manually triggered •Eliminates management of schedules and jobs Gold
  • 26. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver CSV, JSON, TXT… Kinesis Delta Lake also supports batch jobs and standard DML while streams run UPDATE DELETE MERGE OVERWRITE • Retention • Corrections • GDPR INSERT Gold
  • 27. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver CSV, JSON, TXT… Kinesis Easy to recompute when business logic changes: • Clear tables • Restart streams DELETE DELETE Gold
  • 28. How do I use ?
  • 29. dataframe .write .format("delta") .save("/data") Get Started with Delta using Spark APIs dataframe .write .format("parquet") .save("/data") Instead of parquet... … simply say delta Add Spark Package pyspark --packages io.delta:delta-core_2.12:0.1.0 bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0 <dependency> <groupId>io.delta</groupId> <artifactId>delta-core_2.12</artifactId> <version>0.1.0</version> </dependency> Maven
  • 32. Log Structured Storage Changes to the table are stored as ordered, atomic units called commits Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet 000000.json 000001.json …
  • 33. Handling Massive Metadata Large tables can have millions of files in them! How do we scale the metadata? Use Spark for scaling! Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet Checkpoint … 0009.json 0010.json checkpoint-1.parquet 0011.json … Transaction log
  • 34. Transactional Log Parquet Files Delta Lake ensures data reliability Streaming ● ACID Transactions / full DML ● Data quality ● Unified Batch & Streaming ● Time Travel/Data Snapshots Key Features High Quality & Reliable Data always ready for analytics Batch Updates/Deletes
  • 35. Support concurrent operation Notebook/User 1: SELECT * FROM customers WHERE firstname='xxx' Notebook/User 2: INSERT INTO customers (firstname, …) VALUES ('marc', …) Notebook/User 3: DELETE FROM customers WHERE firstname='quentin'
  • 36. Support concurrent operation Isolation level: WriteSerializable Delta solves conflict optimistically Concurrent modifications on a table triggers a rollback
  • 37. Upsert/Merge: Fine-grained Updates MERGE INTO customers -- Delta table USING updates ON customers.customerId = source.customerId WHEN MATCHED THEN UPDATE SET address = updates.address WHEN NOT MATCHED THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)
  • 38. Ensure Data Quality* Enforce metadata, schema, and quality declaratively. Inserts will fail if data doesn’t respect schema or quality table("warehouse") .location(…) // Location on DBFS .schema(my_schema) // Optional strict schema checking .metastoreName(…) // Registration in Hive Metastore .description(…) // Human readable description for users *Coming Soon .expect("validTimestamp", // Expectations on data quality* "timestamp > 2012-01-01 AND …", "fail / alert / quarantine")
  • 39. Unified batch and streaming Concurrent stream/batch with exactly-once processing guarantee Data Lake AI & Reporting Streaming Analytics Join stream with table/stream Bronze Silver CSV, JSON, TXT… Kinesis DELETE DELETE Gold
  • 40. SELECT count(*) FROM events TIMESTAMP AS OF timestamp SELECT count(*) FROM events VERSION AS OF version Time Travel spark.read.format(" delta").option("timestampAsOf", timestamp_string).load("/events/") INSERT INTO my_table SELECT * FROM my_table TIMESTAMP AS OF date_sub( current_date(), 1) Reproduce experiments & reports Rollback accidental bad writes
  • 42. Workshop Delta & MLFlow Jeudi 7 Novembre 9h-12h30 https://dbricks.co/workshop-databricks