2. Lack of etiquette and manners is a huge turn off.
Join the session 5 minutes priorto
the session start time. We start on
time andconclude on time!
Makesure to submita constructive
feedback for all sessions as it is
very helpful for the presenter.
Keep yourmobiledevices in silent
mode, feel free to moveout of
session in case you need to attend
an urgent call.
3. Our Agenda
01 Why Delta Lake ?
02 Data Warehouse
03 Data Lake
04 Possible Solution
4. Why Delta Lake?
Data source come through the systems
like Apache Kafka or Amazon Kinesis
Data is stored for long periods of time in
data lake where it’s optimized for large
scale and low cost.
Valuable data is stored which are then again
optimized for high concurrency & reliability.
The modern data architecture uses the
blend of at least these three different
types of systems.
5. Data Warehouse
2013 2017 2018
● A data management system that stores current and
historical data from multiple sources in a business
friendly manner for easier insights and reporting.
● Data warehouses are typically used for business
intelligence (BI), reporting and data analysis.
➔No support for video, audio, text
➔No support for data science
➔ ML Limited support for streaming Closed & proprietary
(Extract Transform Load)
6. Data Lake
● A central location that holds a large amount of data in its
native, raw format.
● Unstructured and semi-structured data like photos, video,
audio, and documents, which is essential for today’s machine
learning and advanced analytics use cases.
➔Poor BI support Complex to set up
➔Lack of security features
7. What’s the Solution?
A combination of DW & DL
Metadata, Caching &
Reports, BI & Data
8. Data Lakehouse
A system which merges the flexibility, low cost, and scale of
a data lake with the data management and ACID
transactions of data warehouses, addressing the limitations
➔Don’t have to copy data to data lake and another copy to
some data warehouse
➔Cost savings, both in infrastructure and staff and
➔Scalability through underline cloud storage
➔Reliability through ACID transaction.
9. What is Delta Lake?
● Delta Lake is a file-based open-source metadata layer
that enables building Lakehouse architecture on the top of
● It can run on existing data lakes and is fully compatible
with processing engines like Apache Spark
With Delta Lake -
➔Scalable metadata handling
➔Streaming and Batch unification
➔Time Travel (query an oldersnapshotof a Delta table)
10. The Medallion Architecture
Ingestion Tables Refined Tables Feature/Agg Data Store
● No business rules or
transformations of any kind
● Should be fast and easy to
get new data to this layer
● Prioritize speed to market
and write performance- just
● Quality data expected
● Prioritize business use
cases and user experience
● Precalculated, business-
11. Features of Delta Lake
Data lake transactions done using processing
engine are committed for durability and
exposed to other readers in an atomic fashion.
Transaction logs enables the full audit trail
of any changes made to the data
Automatically enforces schema
when writing and reading data
Unification of batch and
Table in Delta Lake is a batch table as well
as a streaming source and sink
Full DML Support
DML operations like deletes and updates,
but also complex data merge, or upsert
Leverages Spark distributedprocessing
power to handle all the metadata for
petabyte-scale tables with billions of files
14. Delta Lake
Choosethe rightpartition column:
If the cardinality of a column will be very high, do
not use that column for partitioning.
Amount of data in each partition. < 1GB
Improve performance on Delta Lake
A large number of small files should be rewritten
into a smaller number of larger files on a regular
basis. Thisis known as compaction.
Enhanced checkpoints for low latency
Replace the content or schema of the
Sometimesyou maywant to replace a Delta table.
Differencebetween Delta Lake and
Parquet on ApacheSpark