Getting Started with Delta Lake on Databricks

Knoldus Inc.
Knoldus Inc.CTO & Co-Founder at Knoldus Software em Knoldus Inc.
Presented By:
Raviyanshu Singh
Software Consultant
Getting Started With
DeltaLake
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes priorto
the session start time. We start on
time andconclude on time!
Feedback
Makesure to submita constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep yourmobiledevices in silent
mode, feel free to moveout of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoidunwantedchitchat during
the session.
Our Agenda
01 Why Delta Lake ?
02 Data Warehouse
03 Data Lake
04 Possible Solution
Delta Lake
05
05
06 Demo
Why Delta Lake?
Streaming Systems
Data source come through the systems
like Apache Kafka or Amazon Kinesis
Data Lakes
Data is stored for long periods of time in
data lake where it’s optimized for large
scale and low cost.
Data Warehouse
Valuable data is stored which are then again
optimized for high concurrency & reliability.
The modern data architecture uses the
blend of at least these three different
types of systems.
Data
Architecture
Data Warehouse
2013 2017 2018
● A data management system that stores current and
historical data from multiple sources in a business
friendly manner for easier insights and reporting.
● Data warehouses are typically used for business
intelligence (BI), reporting and data analysis.
Limitations
➔No support for video, audio, text
➔No support for data science
➔ ML Limited support for streaming Closed & proprietary
formats
ETL
(Extract Transform Load)
Data Source
Data Lake
2017 2018
● A central location that holds a large amount of data in its
native, raw format.
● Unstructured and semi-structured data like photos, video,
audio, and documents, which is essential for today’s machine
learning and advanced analytics use cases.
Limitations
➔Poor BI support Complex to set up
➔Poor performance
➔Lack of security features
➔Reliability issues
What’s the Solution?
A combination of DW & DL
Structured &
Unstructured Data
Data Lake
ETL
Metadata, Caching &
Indexing Layer
Data Validation
Data Warehousing
Reports, BI & Data
Science
Data Lakehouse
2017 2018
A system which merges the flexibility, low cost, and scale of
a data lake with the data management and ACID
transactions of data warehouses, addressing the limitations
of both.
Benefits
➔Don’t have to copy data to data lake and another copy to
some data warehouse
➔Cost savings, both in infrastructure and staff and
consulting overhead.
➔Scalability through underline cloud storage
➔Reliability through ACID transaction.
What is Delta Lake?
2018
● Delta Lake is a file-based open-source metadata layer
that enables building Lakehouse architecture on the top of
data lakes.
● It can run on existing data lakes and is fully compatible
with processing engines like Apache Spark
With Delta Lake -
➔Scalable metadata handling
➔ACID Transactions
➔Streaming and Batch unification
➔Time Travel (query an oldersnapshotof a Delta table)
➔Schema Enforcement
The Medallion Architecture
Ingestion Tables Refined Tables Feature/Agg Data Store
● No business rules or
transformations of any kind
● Should be fast and easy to
get new data to this layer
● Prioritize speed to market
and write performance- just
enough transformations
● Quality data expected
● Prioritize business use
cases and user experience
● Precalculated, business-
specific transformations
Features of Delta Lake
01 02
03 04
06
05
ACID Transactions
Data lake transactions done using processing
engine are committed for durability and
exposed to other readers in an atomic fashion.
Audit History
Transaction logs enables the full audit trail
of any changes made to the data
Schema
Enforcement
Automatically enforces schema
when writing and reading data
from lake
Unification of batch and
streaming
Table in Delta Lake is a batch table as well
as a streaming source and sink
Full DML Support
DML operations like deletes and updates,
but also complex data merge, or upsert
scenarios
Metadata Support
& Scaling
Leverages Spark distributedprocessing
power to handle all the metadata for
petabyte-scale tables with billions of files
at ease
Getting Started With
Delta Lake with
Spark-Shell
Delta Lake in
Pyspark
Delta Lake on
Databricks
1 2
3 4 Hello Delta Lake
Demo
Delta Lake
Best Practices
Choosethe rightpartition column:
If the cardinality of a column will be very high, do
not use that column for partitioning.
Amount of data in each partition. < 1GB
Improve performance on Delta Lake
Merge
Compact Files
A large number of small files should be rewritten
into a smaller number of larger files on a regular
basis. Thisis known as compaction.
Enhanced checkpoints for low latency
queries
Replace the content or schema of the
table.
Sometimesyou maywant to replace a Delta table.
Spark Caching
Differencebetween Delta Lake and
Parquet on ApacheSpark
Thank You !
1 de 15

Mais conteúdo relacionado

Mais procurados(20)

Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
Sergio Zenatti Filho240 visualizações
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks642 visualizações
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle1K visualizações
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks1.5K visualizações
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Databricks888 visualizações
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks3.3K visualizações
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas589 visualizações
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy418 visualizações
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574230 visualizações
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan333 visualizações
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze459 visualizações
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom1K visualizações
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk315 visualizações

Similar a Getting Started with Delta Lake on Databricks(20)

Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY800 visualizações
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
Databricks677 visualizações
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano2.9K visualizações
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel126 visualizações
Oracle Database 11g Lower Your CostsOracle Database 11g Lower Your Costs
Oracle Database 11g Lower Your Costs
Mark Rabne1.8K visualizações
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks5.1K visualizações
Things learned from OpenWorld 2013Things learned from OpenWorld 2013
Things learned from OpenWorld 2013
Connor McDonald40 visualizações
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K visualizações
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
Amazon Web Services888 visualizações
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
James Serra13.6K visualizações
Oracle RAC - Customer Proven ScalabilityOracle RAC - Customer Proven Scalability
Oracle RAC - Customer Proven Scalability
Markus Michalewicz2.2K visualizações
SQL Saturday San DiegoSQL Saturday San Diego
SQL Saturday San Diego
Kellyn Pot'Vin-Gorman406 visualizações

Mais de Knoldus Inc.

GraylogGraylog
GraylogKnoldus Inc.
92 visualizações13 slides
SpringBoot 3 ObservabilitySpringBoot 3 Observability
SpringBoot 3 ObservabilityKnoldus Inc.
218 visualizações10 slides
Resilience4j with Spring BootResilience4j with Spring Boot
Resilience4j with Spring BootKnoldus Inc.
157 visualizações25 slides

Mais de Knoldus Inc.(20)

GraylogGraylog
Graylog
Knoldus Inc.92 visualizações
Design Thinking in Project ManagementDesign Thinking in Project Management
Design Thinking in Project Management
Knoldus Inc.45 visualizações
SpringBoot 3 ObservabilitySpringBoot 3 Observability
SpringBoot 3 Observability
Knoldus Inc.218 visualizações
Cypress Best Pratices for Test AutomationCypress Best Pratices for Test Automation
Cypress Best Pratices for Test Automation
Knoldus Inc.45 visualizações
Business Process Automation A Productivity LeverBusiness Process Automation A Productivity Lever
Business Process Automation A Productivity Lever
Knoldus Inc.106 visualizações
Resilience4j with Spring BootResilience4j with Spring Boot
Resilience4j with Spring Boot
Knoldus Inc.157 visualizações
KnolX-K9SKnolX-K9S
KnolX-K9S
Knoldus Inc.13 visualizações
Dig Deeper With http4sDig Deeper With http4s
Dig Deeper With http4s
Knoldus Inc.20 visualizações
Ansible TowerAnsible Tower
Ansible Tower
Knoldus Inc.187 visualizações
Scaled Agile FrameworkScaled Agile Framework
Scaled Agile Framework
Knoldus Inc.197 visualizações
Twitter FinagleTwitter Finagle
Twitter Finagle
Knoldus Inc.39 visualizações
Why Should we use Microsoft's PlaywrightWhy Should we use Microsoft's Playwright
Why Should we use Microsoft's Playwright
Knoldus Inc.769 visualizações
Navigation and Routing in Ionic AppsNavigation and Routing in Ionic Apps
Navigation and Routing in Ionic Apps
Knoldus Inc.94 visualizações
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
Knoldus Inc.346 visualizações
Vertex AI PresentationVertex AI Presentation
Vertex AI Presentation
Knoldus Inc.117 visualizações
Reactive ProgrammingReactive Programming
Reactive Programming
Knoldus Inc.31 visualizações
KnolX AWS Tech. StackKnolX AWS Tech. Stack
KnolX AWS Tech. Stack
Knoldus Inc.89 visualizações
Introduction to Amazon Kinesis Data StreamsIntroduction to Amazon Kinesis Data Streams
Introduction to Amazon Kinesis Data Streams
Knoldus Inc.24 visualizações
Getting started with FP IOGetting started with FP IO
Getting started with FP IO
Knoldus Inc.12 visualizações
Code-Camp-Rest-PrinciplesCode-Camp-Rest-Principles
Code-Camp-Rest-Principles
Knoldus Inc.18 visualizações

Último(20)

Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman22 visualizações
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation24 visualizações
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet49 visualizações
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh36 visualizações
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum120 visualizações
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray1094 visualizações
Photowave Presentation Slides - 11.8.23.pptxPhotowave Presentation Slides - 11.8.23.pptx
Photowave Presentation Slides - 11.8.23.pptx
CXL Forum120 visualizações
AMD: 4th Generation EPYC CXL DemoAMD: 4th Generation EPYC CXL Demo
AMD: 4th Generation EPYC CXL Demo
CXL Forum123 visualizações

Getting Started with Delta Lake on Databricks

  • 1. Presented By: Raviyanshu Singh Software Consultant Getting Started With DeltaLake
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes priorto the session start time. We start on time andconclude on time! Feedback Makesure to submita constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep yourmobiledevices in silent mode, feel free to moveout of session in case you need to attend an urgent call. Avoid Disturbance Avoidunwantedchitchat during the session.
  • 3. Our Agenda 01 Why Delta Lake ? 02 Data Warehouse 03 Data Lake 04 Possible Solution Delta Lake 05 05 06 Demo
  • 4. Why Delta Lake? Streaming Systems Data source come through the systems like Apache Kafka or Amazon Kinesis Data Lakes Data is stored for long periods of time in data lake where it’s optimized for large scale and low cost. Data Warehouse Valuable data is stored which are then again optimized for high concurrency & reliability. The modern data architecture uses the blend of at least these three different types of systems. Data Architecture
  • 5. Data Warehouse 2013 2017 2018 ● A data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. ● Data warehouses are typically used for business intelligence (BI), reporting and data analysis. Limitations ➔No support for video, audio, text ➔No support for data science ➔ ML Limited support for streaming Closed & proprietary formats ETL (Extract Transform Load) Data Source
  • 6. Data Lake 2017 2018 ● A central location that holds a large amount of data in its native, raw format. ● Unstructured and semi-structured data like photos, video, audio, and documents, which is essential for today’s machine learning and advanced analytics use cases. Limitations ➔Poor BI support Complex to set up ➔Poor performance ➔Lack of security features ➔Reliability issues
  • 7. What’s the Solution? A combination of DW & DL Structured & Unstructured Data Data Lake ETL Metadata, Caching & Indexing Layer Data Validation Data Warehousing Reports, BI & Data Science
  • 8. Data Lakehouse 2017 2018 A system which merges the flexibility, low cost, and scale of a data lake with the data management and ACID transactions of data warehouses, addressing the limitations of both. Benefits ➔Don’t have to copy data to data lake and another copy to some data warehouse ➔Cost savings, both in infrastructure and staff and consulting overhead. ➔Scalability through underline cloud storage ➔Reliability through ACID transaction.
  • 9. What is Delta Lake? 2018 ● Delta Lake is a file-based open-source metadata layer that enables building Lakehouse architecture on the top of data lakes. ● It can run on existing data lakes and is fully compatible with processing engines like Apache Spark With Delta Lake - ➔Scalable metadata handling ➔ACID Transactions ➔Streaming and Batch unification ➔Time Travel (query an oldersnapshotof a Delta table) ➔Schema Enforcement
  • 10. The Medallion Architecture Ingestion Tables Refined Tables Feature/Agg Data Store ● No business rules or transformations of any kind ● Should be fast and easy to get new data to this layer ● Prioritize speed to market and write performance- just enough transformations ● Quality data expected ● Prioritize business use cases and user experience ● Precalculated, business- specific transformations
  • 11. Features of Delta Lake 01 02 03 04 06 05 ACID Transactions Data lake transactions done using processing engine are committed for durability and exposed to other readers in an atomic fashion. Audit History Transaction logs enables the full audit trail of any changes made to the data Schema Enforcement Automatically enforces schema when writing and reading data from lake Unification of batch and streaming Table in Delta Lake is a batch table as well as a streaming source and sink Full DML Support DML operations like deletes and updates, but also complex data merge, or upsert scenarios Metadata Support & Scaling Leverages Spark distributedprocessing power to handle all the metadata for petabyte-scale tables with billions of files at ease
  • 12. Getting Started With Delta Lake with Spark-Shell Delta Lake in Pyspark Delta Lake on Databricks 1 2 3 4 Hello Delta Lake
  • 13. Demo
  • 14. Delta Lake Best Practices Choosethe rightpartition column: If the cardinality of a column will be very high, do not use that column for partitioning. Amount of data in each partition. < 1GB Improve performance on Delta Lake Merge Compact Files A large number of small files should be rewritten into a smaller number of larger files on a regular basis. Thisis known as compaction. Enhanced checkpoints for low latency queries Replace the content or schema of the table. Sometimesyou maywant to replace a Delta table. Spark Caching Differencebetween Delta Lake and Parquet on ApacheSpark