SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
25127
Migrating to Apache
Iceberg
1
Presented by Alex Merced - Dremio Developer Advocate - Follow @amdatalakehouse
25127
What will we
learn?
- Iceberg’s Structure
- What Catalog to use?
- The migrate procedure
- The add_files procedure
- Migrate with CTAS Statements
- Summary
25127
Apache Iceberg Apache Iceberg’s approach is to define the table through three layers of
metadata. These categories are:
- metadata files that define the table
- manifest lists that define a snapshot of the table, with a list of
manifests that make up the snapshot and metadata about their data
- manifests is a list of data files along with metadata on those data
files for file pruning
25127
Iceberg table format
● Overview of the components
● Summary of the read path (SELECT) #
metadata layer
data layer
25127
Iceberg Catalog
● A store that houses the current
metadata pointer for Iceberg tables
● Must support atomic operations for
updating the current metadata pointer
(e.g. HDFS, HMS, Nessie)
table1’s current metadata pointer
● Mapping of table name to the location
of current metadata file
Iceberg components: Catalog
1
2
metadata layer
data layer
25127
Metadata file - stores metadata about a
table at a certain point in time
{
"table-uuid" : "<uuid>",
"location" : "/path/to/table/dir",
"schema": {...},
"partition-spec": [ {<partition-details>}, ...],
"current-snapshot-id": <snapshot-id>,
"snapshots": [ {
"snapshot-id": <snapshot-id>
"manifest-list": "/path/to/manifest/list.avro"
}, ...],
...
}
Iceberg components: Metadata File
3
5
4
metadata layer
data layer
25127
Iceberg components: Manifest List
Manifest list file - a list of manifest files
{
"manifest-path" : "/path/to/manifest/file.avro",
"added-snapshot-id": <snapshot-id>,
"partition-spec-id": <partition-spec-id>,
"partitions": [ {partition-info}, ...],
...
}
{
"manifest-path" : "/path/to/manifest/file2.avro",
"added-snapshot-id": <snapshot-id>,
"partition-spec-id": <partition-spec-id>,
"partitions": [ {partition-info}, ...],
...
}
6
metadata layer
data layer
25127
Manifest file - a list of data files, along with
details and stats about each data file
{
"data-file": {
"file-path": "/path/to/data/file.parquet",
"file-format": "PARQUET",
"partition":
{"<part-field>":{"<data-type>":<value>}},
"record-count": <num-records>,
"null-value-counts": [{
"column-index": "1", "value": 4
}, ...],
"lower-bounds": [{
"column-index": "1", "value": "aaa"
}, ...],
"upper-bounds": [{
"column-index": "1", "value": "eee"
}, ...],
}
...
}
{
...
}
Iceberg components: Manifest file
7
8
metadata layer
data layer
25127
Iceberg Catalogs
9
25127
10
Type of Catalog Pros Cons
Project Nessie - Git Like Functionality
- Cloud Managed Service (Arctic)
- Support from engines beyond
Spark & Dremio
- If not using arctic must deploy
and maintain Nessie server
Hive Metastore - Can use existing Hive Metastore - You have to deploy and maintain
a hive metastore
AWS Glue - Interop with AWS Services - Support outside of AWS, Spark
and Dremio
What can be used an Iceberg Catalog
Catalogs help tracks Iceberg tables and provide locking mechanisms for ACID Guarantees.
Thing to keep in mind is that while many engines may support Iceberg tables they may not
support connections to all catalogs.
- HDFS, JDBC and REST catalogs also available
- With 0.14 there will be a registerTable method to migrate tables between catalogs
without losing snapshot history
25127
Migrating to Iceberg
11
25127
Existing Parquet files that are part of a Hive Table
Apache Iceberg has a Call procedure called migrate which can be used to create a new Iceberg table from an
the existing Hive table. This uses the existing data files in the table (must be parquet). This REPLACES the
Hive table with an Iceberg table, so the Hive table will no longer exist after this operation:
CALL catalog_name.system.migrate('db.sample')
To test the results of migrate before running migrate you can use the snapshot call procedure to create
temporary iceberg table based on a Hive table without replacing the Hive table. (temporary tables have
limitations).
CALL catalog_name.system.snapshot('db.sample', 'db.snap')
12
25127
Parquet files not part of a hive table
If you have paquet files representing a dataset that are not yet part of an Iceberg table, you can use the
add_files procedure to add those datafiles to that of an existing table with a matching schema.
CALL spark_catalog.system.add_files(
table => 'db.tbl',
source_table => '`parquet`.`path/to/table`'
)
13
25127
Migration by Restating the Data
You can use CREATE TABLE…AS (CTAS) statements to create new Iceberg tables from pre-existing tables.
This will create new data files but allows you to modify partitioning, schema and do validations at the time of
migration if choose to. Since the data is being rewritten/restated, this will be more time consuming that
migrate or add_files. (This can be done with Spark or Dremio)
CREATE TABLE prod.db.sample
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), truncate(last_name, 2))
AS SELECT ...
14
25127
When to use which approach?
15
Approach
Are Data Files
Rewritten
Can I update Schema
and Partition while
migrating?
When to use?
migrate No No
Moving existing hive table to a
NEW iceberg table
add_files No No
Moving files from existing table
in parquet to an existing
Iceberg table
CTAS Yes Yes
Moving existing table
anywhere to NEW iceberg
table with schema or
partitioning changes
registerTable
(to be released with 0.14)
No No
Moving Iceberg table from one
catalog to another.
25127
Best Practices
1. As the process begins, the new Iceberg table is not yet created or in sync with the source. User-facing
read and write operations remain operating on the source table.
2. The table is created but not fully in sync. Read operations are applied on the source table and write
operations are applied to the source and new table.
3. Once the new table is in sync, you can switch to read operations on the new table. Until you are certain
the migration is successful, continue to apply write operations to the source and new table.
4. When everything is tested, synced, and working properly, you can apply all read and write operations to
the new Iceberg table and retire the source table
16
25127
Q&A
17

Mais conteúdo relacionado

Mais procurados

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 

Mais procurados (20)

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 

Semelhante a Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features"
Anar Godjaev
 

Semelhante a Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg (20)

OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Redefining tables online without surprises
Redefining tables online without surprisesRedefining tables online without surprises
Redefining tables online without surprises
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
 
Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3
 
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfZesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
 
Db2 Important questions to read
Db2 Important questions to readDb2 Important questions to read
Db2 Important questions to read
 
July 2017 Meeting of the Denver AWS Users' Group
July 2017 Meeting of the Denver AWS Users' GroupJuly 2017 Meeting of the Denver AWS Users' Group
July 2017 Meeting of the Denver AWS Users' Group
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
SQL Server 2016 - Stretch DB
SQL Server 2016 - Stretch DB SQL Server 2016 - Stretch DB
SQL Server 2016 - Stretch DB
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflake
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
AWS RDS Migration Tool
AWS RDS Migration Tool AWS RDS Migration Tool
AWS RDS Migration Tool
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features"
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under control
 

Mais de Anant Corporation

NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 

Mais de Anant Corporation (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
YugabyteDB Developer Tools
YugabyteDB Developer ToolsYugabyteDB Developer Tools
YugabyteDB Developer Tools
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
CL 121
CL 121CL 121
CL 121
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

  • 1. 25127 Migrating to Apache Iceberg 1 Presented by Alex Merced - Dremio Developer Advocate - Follow @amdatalakehouse
  • 2. 25127 What will we learn? - Iceberg’s Structure - What Catalog to use? - The migrate procedure - The add_files procedure - Migrate with CTAS Statements - Summary
  • 3. 25127 Apache Iceberg Apache Iceberg’s approach is to define the table through three layers of metadata. These categories are: - metadata files that define the table - manifest lists that define a snapshot of the table, with a list of manifests that make up the snapshot and metadata about their data - manifests is a list of data files along with metadata on those data files for file pruning
  • 4. 25127 Iceberg table format ● Overview of the components ● Summary of the read path (SELECT) # metadata layer data layer
  • 5. 25127 Iceberg Catalog ● A store that houses the current metadata pointer for Iceberg tables ● Must support atomic operations for updating the current metadata pointer (e.g. HDFS, HMS, Nessie) table1’s current metadata pointer ● Mapping of table name to the location of current metadata file Iceberg components: Catalog 1 2 metadata layer data layer
  • 6. 25127 Metadata file - stores metadata about a table at a certain point in time { "table-uuid" : "<uuid>", "location" : "/path/to/table/dir", "schema": {...}, "partition-spec": [ {<partition-details>}, ...], "current-snapshot-id": <snapshot-id>, "snapshots": [ { "snapshot-id": <snapshot-id> "manifest-list": "/path/to/manifest/list.avro" }, ...], ... } Iceberg components: Metadata File 3 5 4 metadata layer data layer
  • 7. 25127 Iceberg components: Manifest List Manifest list file - a list of manifest files { "manifest-path" : "/path/to/manifest/file.avro", "added-snapshot-id": <snapshot-id>, "partition-spec-id": <partition-spec-id>, "partitions": [ {partition-info}, ...], ... } { "manifest-path" : "/path/to/manifest/file2.avro", "added-snapshot-id": <snapshot-id>, "partition-spec-id": <partition-spec-id>, "partitions": [ {partition-info}, ...], ... } 6 metadata layer data layer
  • 8. 25127 Manifest file - a list of data files, along with details and stats about each data file { "data-file": { "file-path": "/path/to/data/file.parquet", "file-format": "PARQUET", "partition": {"<part-field>":{"<data-type>":<value>}}, "record-count": <num-records>, "null-value-counts": [{ "column-index": "1", "value": 4 }, ...], "lower-bounds": [{ "column-index": "1", "value": "aaa" }, ...], "upper-bounds": [{ "column-index": "1", "value": "eee" }, ...], } ... } { ... } Iceberg components: Manifest file 7 8 metadata layer data layer
  • 10. 25127 10 Type of Catalog Pros Cons Project Nessie - Git Like Functionality - Cloud Managed Service (Arctic) - Support from engines beyond Spark & Dremio - If not using arctic must deploy and maintain Nessie server Hive Metastore - Can use existing Hive Metastore - You have to deploy and maintain a hive metastore AWS Glue - Interop with AWS Services - Support outside of AWS, Spark and Dremio What can be used an Iceberg Catalog Catalogs help tracks Iceberg tables and provide locking mechanisms for ACID Guarantees. Thing to keep in mind is that while many engines may support Iceberg tables they may not support connections to all catalogs. - HDFS, JDBC and REST catalogs also available - With 0.14 there will be a registerTable method to migrate tables between catalogs without losing snapshot history
  • 12. 25127 Existing Parquet files that are part of a Hive Table Apache Iceberg has a Call procedure called migrate which can be used to create a new Iceberg table from an the existing Hive table. This uses the existing data files in the table (must be parquet). This REPLACES the Hive table with an Iceberg table, so the Hive table will no longer exist after this operation: CALL catalog_name.system.migrate('db.sample') To test the results of migrate before running migrate you can use the snapshot call procedure to create temporary iceberg table based on a Hive table without replacing the Hive table. (temporary tables have limitations). CALL catalog_name.system.snapshot('db.sample', 'db.snap') 12
  • 13. 25127 Parquet files not part of a hive table If you have paquet files representing a dataset that are not yet part of an Iceberg table, you can use the add_files procedure to add those datafiles to that of an existing table with a matching schema. CALL spark_catalog.system.add_files( table => 'db.tbl', source_table => '`parquet`.`path/to/table`' ) 13
  • 14. 25127 Migration by Restating the Data You can use CREATE TABLE…AS (CTAS) statements to create new Iceberg tables from pre-existing tables. This will create new data files but allows you to modify partitioning, schema and do validations at the time of migration if choose to. Since the data is being rewritten/restated, this will be more time consuming that migrate or add_files. (This can be done with Spark or Dremio) CREATE TABLE prod.db.sample USING iceberg PARTITIONED BY (bucket(16, id), days(ts), truncate(last_name, 2)) AS SELECT ... 14
  • 15. 25127 When to use which approach? 15 Approach Are Data Files Rewritten Can I update Schema and Partition while migrating? When to use? migrate No No Moving existing hive table to a NEW iceberg table add_files No No Moving files from existing table in parquet to an existing Iceberg table CTAS Yes Yes Moving existing table anywhere to NEW iceberg table with schema or partitioning changes registerTable (to be released with 0.14) No No Moving Iceberg table from one catalog to another.
  • 16. 25127 Best Practices 1. As the process begins, the new Iceberg table is not yet created or in sync with the source. User-facing read and write operations remain operating on the source table. 2. The table is created but not fully in sync. Read operations are applied on the source table and write operations are applied to the source and new table. 3. Once the new table is in sync, you can switch to read operations on the new table. Until you are certain the migration is successful, continue to apply write operations to the source and new table. 4. When everything is tested, synced, and working properly, you can apply all read and write operations to the new Iceberg table and retire the source table 16