A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft em Databricks
A Thorough Comparison of
Delta Lake, Iceberg and Hudi
Junjie Chen
About Me
▪ Software engineer at Tencent Data Lake Team
▪ Focus on big data area for years
Agenda
Introduction to Delta
Lake, Apache Iceberg
and Apache Hudi
Key Features
Comparison
▪ Transaction
▪ Data mutation
▪ Streaming
Support
▪ Schema evolution
Maturity
▪ Tooling
▪ Integration
▪ Performance
Conclusion
What features are expect for the data lake?
Data Lake
Data Quality
Transaction
(ACID)
Independence
of Engines
Unified Batch
& Streaming
Storage
Pluggable
Scalable
Metadata
Data
Mutation
Delta Lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™
and big data workloads.
Apache Iceberg
An table format for huge analytic datasets which delivers high query performance for tables with tens of
petabytes of data, along with atomic commits, concurrent writes, and SQL-compatible table evolution.
DFS/Cloud Storage
Spark Batch
&
Streaming
AI &
Reporting
Interactive
Queries
Streaming
Streaming
Analytics
Apache Hudi
Apache Hudi ingests & manages storage of large analytical datasets over DFS
A Quick Comparison
Delta Lake (open source) Apache Iceberg Apache Hudi
Transaction (ACID) Y Y Y
MVCC Y Y Y
Time travel Y Y Y
Schema Evolution Y Y Y
Data Mutation Y (update/delete/merge into) N Y (upsert)
Streaming Sink and source for spark struct
streaming
Sink and source(wip) for Spark
struct streaming, Flink (wip)
DeltaStreamer
HiveIncrementalPuller
File Format Parquet Parquet, ORC, AVRO Parquet
Compaction/Cleanup Manual API available (Spark Action) Manual and Auto
Integration DSv1, Delta connector DSv2, InputFormat DSv1, InputFormat
Multiple language support Scala/java/python Java/python Java/python
Storage Abstraction Y Y N
API dependency Spark-bundled Native/Engine bundled Spark-bundled
Data ingestion Spark, presto, hive Spark, hive DeltaStreamer
2020-05
Transaction
Delta Lake
▪ Model
▪ Transaction Log (DeltaLog)
▪ Optimistic concurrency control
▪ Checkpoint changes into parquet
▪ Atomicity Guarantee
▪ HDFS rename
▪ S3 file write
▪ Azure rename without overwrite
▪ Time Travel
▪ timestamp
▪ version number
Apache Iceberg
▪ Model
▪ Snapshot
▪ Optimistic concurrency control
▪ Atomicity Guarantee
▪ HDFS Rename
▪ Hive metastore lock
▪ Time Travel
▪ snapshot id
▪ timestamp
R W
S1 S2 S3 S4
Apache Hudi
▪ Model
▪ Timeline
▪ Optimistic concurrency control
▪ Atomicity Guarantee
▪ HDFS rename
▪ Time Travel
▪ Hoodie_commit_time
Data Mutation
Delta Lake
▪ Copy on Write mode
▪ Step 1: find files to delete according to filter expression
▪ Step 2: load files as dataframe and update column values in rows
▪ Step 3: save dataframe to new files
▪ Step 4: logs the files to delete and add into JSON, commit to table
▪ Table level APIs
▪ update, delete (condition based)
▪ merge into (upsert a source into target table)
Apache Hudi
▪ Copy on Write table
▪ Step1: read out records from parquet
▪ Step2: merge records according to passing update records
▪ Step3: write merged records to files
▪ Step4: commit to table commitActionExecutor
▪ Merge on Read table
▪ Store delta records into AVRO format log file
▪ Scheduled compaction
▪ Indexing
▪ Mapping Hudi record key (in metadata column) to file group and file id
▪ In-memory, bloom filter and HBase
▪ Table level APIs
▪ upsert
Apache Iceberg
▪ Copy on Write Mode
▪ File level overwrite APIs available
▪ Merge on Read mode
▪ Position based delete files and equality based delete files
Streaming Support
Delta Lake
▪ Deeply integrated with Spark Structured Streaming
▪ As a streaming source
▪ Streaming control: maxBytesPerTrigger, maxFilesPerTrigger
▪ Does NOT handle non-append (ignoreDeletes or ignoreChanges)
▪ As a streaming sink
▪ Append mode
▪ Complete mode
Apache Hudi
▪ DeltaStreamer
▪ Exactly once ingestion of new event from Kafka
▪ Support JSON, AVRO or custom record types
▪ Manage checkpoints, rollback & recovery
▪ Support for plugging in transformations
▪ Incremental Queries
▪ HiveIncrementalPuller
▪ As Spark data source (beginInstantTime)
Apache Iceberg
▪ Support spark struct streaming
▪ As streaming source (WIP)
▪ Rate limit: max-files-per-batch
▪ Offset range
▪ As streaming sink
▪ Append mode
▪ Complete mode
▪ Support flink (WIP)
Table Schema Evolution
▪ Delta Lake
▪ Use Spark schema
▪ Allow Schema merge and overwrite
▪ Apache Hudi
▪ Use Spark schema
▪ Support adding new fields in stream, column delete is not allowed.
▪ Apache Iceberg
▪ Independent ID-based schema abstraction
▪ Full schema evolution and partition evolution
Maturity
Integrations
▪ Delta Lake
▪ DSv1
▪ Delta.io connector enable Apache Hive, Presto
▪ Apache Iceberg
▪ DSv2, InputFormat, Hive StorageHandle (WIP)
▪ Flink sink(WIP)
▪ Apache Hudi
▪ InputFormat, DSv1
▪ DeltaStreamer for data ingesting
Query Performance Optimization
▪ Delta Lake
▪ Vectorization from Spark
▪ Data skipping via statistic from Parquet
▪ Vacuum, optimize
▪ Apache Hudi
▪ Vectorization from Spark
▪ Data skipping via statistic from Parquet
▪ Auto compaction
▪ Apache Iceberg
▪ Predicate push down
▪ Native vectorized reader (WIP)
▪ Statistic from Iceberg manifest file
▪ Hidden partitioning
Tooling
▪ Delta Lake
▪ CLI: VACUUM, HISTORY, GENERATE, CONVERT TO
▪ Apache Iceberg
▪ Metadata visible as table
▪ Built-in catalog service, enable DDL, DML support in Spark-3.0
▪ Apache Hudi
▪ CLI, auxiliary commands( inspecting, view, statistics, compaction etc..)
▪ DeltaStreamer, HiveIncrementalPuller, HoodieDeltaStreamer
Conclusion
▪ Delta Lake has best integration with Spark ecosystem and could
be used out of box.
▪ Apache Iceberg has great design and abstraction that enable
more potentials
▪ Apache Hudi provides most conveniences for streaming process
Thank You & Questions
1 de 27

Mais conteúdo relacionado

Mais procurados(20)

Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward582 visualizações
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks621 visualizações
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks746 visualizações
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk315 visualizações
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
Databricks794 visualizações
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal2.8K visualizações
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra6.2K visualizações
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.562 visualizações
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas589 visualizações
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit7.5K visualizações
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K visualizações

Similar a A Thorough Comparison of Delta Lake, Iceberg and Hudi

Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
1.2K visualizações83 slides
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
1.4K visualizações37 slides

Similar a A Thorough Comparison of Delta Lake, Iceberg and Hudi(20)

Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle1K visualizações
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar1.2K visualizações
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
Adam Muise1.4K visualizações
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
Claudiu Barbura8.8K visualizações
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy3.2K visualizações
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura1K visualizações
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
Rikin Tanna168 visualizações
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI8.8K visualizações
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Andrey Kudryavtsev31 visualizações
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh2.3K visualizações

Mais de Databricks(20)

DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K visualizações
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K visualizações
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks739 visualizações
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 visualizações
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks604 visualizações

Último

MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIAFederico Karagulian
5 visualizações8 slides
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
12 visualizações24 slides
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdfstuartmcphersonflipm
314 visualizações1 slide

Último(20)

MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIA
Federico Karagulian5 visualizações
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 visualizações
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 visualizações
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0114 visualizações
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 visualizações
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm314 visualizações
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar14 visualizações
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4918 visualizações
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela166 visualizações
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 visualizações
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 visualizações
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 visualizações
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
neeharikasingh295 visualizações
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 visualizações
ColonyOSColonyOS
ColonyOS
JohanKristiansson69 visualizações
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials8 visualizações
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 visualizações

A Thorough Comparison of Delta Lake, Iceberg and Hudi

  • 1. A Thorough Comparison of Delta Lake, Iceberg and Hudi Junjie Chen
  • 2. About Me ▪ Software engineer at Tencent Data Lake Team ▪ Focus on big data area for years
  • 3. Agenda Introduction to Delta Lake, Apache Iceberg and Apache Hudi Key Features Comparison ▪ Transaction ▪ Data mutation ▪ Streaming Support ▪ Schema evolution Maturity ▪ Tooling ▪ Integration ▪ Performance Conclusion
  • 4. What features are expect for the data lake? Data Lake Data Quality Transaction (ACID) Independence of Engines Unified Batch & Streaming Storage Pluggable Scalable Metadata Data Mutation
  • 5. Delta Lake Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
  • 6. Apache Iceberg An table format for huge analytic datasets which delivers high query performance for tables with tens of petabytes of data, along with atomic commits, concurrent writes, and SQL-compatible table evolution. DFS/Cloud Storage Spark Batch & Streaming AI & Reporting Interactive Queries Streaming Streaming Analytics
  • 7. Apache Hudi Apache Hudi ingests & manages storage of large analytical datasets over DFS
  • 8. A Quick Comparison Delta Lake (open source) Apache Iceberg Apache Hudi Transaction (ACID) Y Y Y MVCC Y Y Y Time travel Y Y Y Schema Evolution Y Y Y Data Mutation Y (update/delete/merge into) N Y (upsert) Streaming Sink and source for spark struct streaming Sink and source(wip) for Spark struct streaming, Flink (wip) DeltaStreamer HiveIncrementalPuller File Format Parquet Parquet, ORC, AVRO Parquet Compaction/Cleanup Manual API available (Spark Action) Manual and Auto Integration DSv1, Delta connector DSv2, InputFormat DSv1, InputFormat Multiple language support Scala/java/python Java/python Java/python Storage Abstraction Y Y N API dependency Spark-bundled Native/Engine bundled Spark-bundled Data ingestion Spark, presto, hive Spark, hive DeltaStreamer 2020-05
  • 10. Delta Lake ▪ Model ▪ Transaction Log (DeltaLog) ▪ Optimistic concurrency control ▪ Checkpoint changes into parquet ▪ Atomicity Guarantee ▪ HDFS rename ▪ S3 file write ▪ Azure rename without overwrite ▪ Time Travel ▪ timestamp ▪ version number
  • 11. Apache Iceberg ▪ Model ▪ Snapshot ▪ Optimistic concurrency control ▪ Atomicity Guarantee ▪ HDFS Rename ▪ Hive metastore lock ▪ Time Travel ▪ snapshot id ▪ timestamp R W S1 S2 S3 S4
  • 12. Apache Hudi ▪ Model ▪ Timeline ▪ Optimistic concurrency control ▪ Atomicity Guarantee ▪ HDFS rename ▪ Time Travel ▪ Hoodie_commit_time
  • 14. Delta Lake ▪ Copy on Write mode ▪ Step 1: find files to delete according to filter expression ▪ Step 2: load files as dataframe and update column values in rows ▪ Step 3: save dataframe to new files ▪ Step 4: logs the files to delete and add into JSON, commit to table ▪ Table level APIs ▪ update, delete (condition based) ▪ merge into (upsert a source into target table)
  • 15. Apache Hudi ▪ Copy on Write table ▪ Step1: read out records from parquet ▪ Step2: merge records according to passing update records ▪ Step3: write merged records to files ▪ Step4: commit to table commitActionExecutor ▪ Merge on Read table ▪ Store delta records into AVRO format log file ▪ Scheduled compaction ▪ Indexing ▪ Mapping Hudi record key (in metadata column) to file group and file id ▪ In-memory, bloom filter and HBase ▪ Table level APIs ▪ upsert
  • 16. Apache Iceberg ▪ Copy on Write Mode ▪ File level overwrite APIs available ▪ Merge on Read mode ▪ Position based delete files and equality based delete files
  • 18. Delta Lake ▪ Deeply integrated with Spark Structured Streaming ▪ As a streaming source ▪ Streaming control: maxBytesPerTrigger, maxFilesPerTrigger ▪ Does NOT handle non-append (ignoreDeletes or ignoreChanges) ▪ As a streaming sink ▪ Append mode ▪ Complete mode
  • 19. Apache Hudi ▪ DeltaStreamer ▪ Exactly once ingestion of new event from Kafka ▪ Support JSON, AVRO or custom record types ▪ Manage checkpoints, rollback & recovery ▪ Support for plugging in transformations ▪ Incremental Queries ▪ HiveIncrementalPuller ▪ As Spark data source (beginInstantTime)
  • 20. Apache Iceberg ▪ Support spark struct streaming ▪ As streaming source (WIP) ▪ Rate limit: max-files-per-batch ▪ Offset range ▪ As streaming sink ▪ Append mode ▪ Complete mode ▪ Support flink (WIP)
  • 21. Table Schema Evolution ▪ Delta Lake ▪ Use Spark schema ▪ Allow Schema merge and overwrite ▪ Apache Hudi ▪ Use Spark schema ▪ Support adding new fields in stream, column delete is not allowed. ▪ Apache Iceberg ▪ Independent ID-based schema abstraction ▪ Full schema evolution and partition evolution
  • 23. Integrations ▪ Delta Lake ▪ DSv1 ▪ Delta.io connector enable Apache Hive, Presto ▪ Apache Iceberg ▪ DSv2, InputFormat, Hive StorageHandle (WIP) ▪ Flink sink(WIP) ▪ Apache Hudi ▪ InputFormat, DSv1 ▪ DeltaStreamer for data ingesting
  • 24. Query Performance Optimization ▪ Delta Lake ▪ Vectorization from Spark ▪ Data skipping via statistic from Parquet ▪ Vacuum, optimize ▪ Apache Hudi ▪ Vectorization from Spark ▪ Data skipping via statistic from Parquet ▪ Auto compaction ▪ Apache Iceberg ▪ Predicate push down ▪ Native vectorized reader (WIP) ▪ Statistic from Iceberg manifest file ▪ Hidden partitioning
  • 25. Tooling ▪ Delta Lake ▪ CLI: VACUUM, HISTORY, GENERATE, CONVERT TO ▪ Apache Iceberg ▪ Metadata visible as table ▪ Built-in catalog service, enable DDL, DML support in Spark-3.0 ▪ Apache Hudi ▪ CLI, auxiliary commands( inspecting, view, statistics, compaction etc..) ▪ DeltaStreamer, HiveIncrementalPuller, HoodieDeltaStreamer
  • 26. Conclusion ▪ Delta Lake has best integration with Spark ecosystem and could be used out of box. ▪ Apache Iceberg has great design and abstraction that enable more potentials ▪ Apache Hudi provides most conveniences for streaming process
  • 27. Thank You & Questions