Building Lakehouses on Delta Lake with SQL Analytics Primer

Building Lakehouses
on Delta Lake
and SQL Analytics-
A Primer
Franco Patano
Senior Solutions Architect, Databricks
@fpatano
linkedin.com/in/francopatano/

Wayne Dyer
If you believe it will work out, you’ll
see opportunities. If you believe it
won’t you’ll see obstacles.

Agenda
▪ What is Lakehouse
▪ Delta Lake Architecture
▪ Delta Engine Optimizations
▪ SQL Analytics
▪ Implementation Example
▪ Frictionless Loading

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Streaming
Batch
One platform to unify all of
your data, analytics, and AI workloads
Filtered, Cleaned,
Augmented
Silver
Business-level
Aggregates
Gold
Semi-structured
Unstructured
Structured
Raw Ingestion
and History
Bronze

Implementing Lakehouse Architecture with
Delta Lake
Bronze Silver Gold
Data usability
Raw Ingestion,
and History
Filtered, Cleaned,
Augmented
Business-level
Aggregates
AutoLoader
Structured
Streaming
Batch
COPY INTO
Partners
Land data as it is received
Provenance to source
Handle NULLS
Fix bad dates (1970-01-01)
Clean text fields
Demux nested objects
Friendly field names
Analytics Engineering
Business Models
Aggregates for visible dimensions
Business friendly field names
Common logical views

Table Structure
Stats are only collected on the first 32 ordinal
fields, including fields in nested structures
• You can change this with this property:
dataSkippingNumIndexedCols
Restructure data accordingly
• Move numericals, keys, high cardinality query predicates
to the left, long strings that are not distinct enough for
stats collection, and date/time stamps to the right past the
• Long strings are kryptonite to stats collection, move these
to past the 32nd position, or past
Numerical, Keys, High Cardinality Long Strings, Date/Time
32 columns or dataSkippingNumIndexedCols

Optimize and Z-Order
Optimize will bin pack our files for better read performance
Z-Order will organize our data for better data skipping
What fields should you Z-Order by?
Fields that are being joined on, or included in a predicate
• Primary Key , Foriegn Keys on dim and fact tables
• ID fields that are joined to other tables
• High Cardinality fields used in query predicates

Partitioning and Z-Order effectiveness
High Cardinality Regular Cardinality Low Cardinality
Very Uncommon or Unique Datum
● User or Device ID
● Email Address
● Phone Number
Common Repeatable Data
● People or Object Names
● Street Addresses
● Categories
Repeatable, limited distinct data
● Gender
● Status Flags
● Boolean Values
SELECT COUNT(DISTINCT(x))
Partitioning effectiveness
Z-Order effectiveness

Tips for each layer
✓ When files, land raw
✓ When streaming, land in delta raw
✓ Turn off stats collection
○ dataSkippingNumIndexedCols 0
✓ Optimize and Z-Order by merge
join keys between Bronze and
Silver
✓ Restructure columns to account
for data skipping index columns
✓ Use Delta Cache Enabled clusters
○ or enable it for other types YMMV
✓ Optimize and Z-Order by join keys
or common High Cardinality query
predicates
✓ Turn up Staleness Limit to align
with your orchestration
✓ Use SQL Analytics for Analysts
Business-level
Aggregates
Filtered, Cleaned,
Augmented
Raw Ingestion,
and History
Bronze Silver Gold
MERGE INTO
JOIN KEYS
MERGE INTO
JOIN KEYS

Databricks SQL Analytics
Delivering analytics on the freshest data
with data warehouse performance and
data lake economics
• Query your lakehouse with better price / performance
• Simplify discovery and sharing of new insights
• Connect to familiar BI tools, like Tableau or Power BI
• Simplify administration and governance

Why did Databricks Create SQL Analytics?
➔ Customers have standardized on data lakes
as a foundation for modern data analytics
➔ ~41% of queries on Databricks are SQL
➔ SQL Analytics was created to provide these
users with a familiar SQL editor experience

Easy to use SQL experience
Enable data analysts to quickly
perform ad-hoc and exploratory
data analysis, with a new and easy
to use SQL query editor, built-in
visualizations and dashboards.
Automatic alerts can be triggered
for critical changes, allowing to
respond to business needs faster.

Simple administration and governance
Quickly setup SQL / BI
optimized compute with SQL
endpoints. Databricks automatically
determines instance types and
configuration for the best
price/performance. Then, easily
manage usage, perform quick
auditing, and troubleshooting with
query history.

Curated Data
Delta Lake helps build curated data lakes, so
you can store and manage all your data in
one place and standardize your big data
storage with an open format accessible from
various tools.
Curated Data
Structured, Semi-Structured, and Unstructured Data
Filtered, Cleaned,
Augmented
Silver
Raw Ingestion
and History
Bronze
Business-level
Aggregates
Gold
SQL Editor: Based on Redash, the SQL
Native Interface provides a simple and
familiar experience for data analysts to
explore data, query their data lakes,
visualize results, share dashboards,
and setup automatic alerts.
Query Execution: SQL Analytics’s compute
clusters are powered by Photon Engine.
100% Apache Spark-compatible vectorized
query engine designed to take advantage of
modern CPU architecture for extremely fast
parallel processing of data
Optimized ODBC/JDBC Drivers
Re-engineered drivers provide lower latency
and less overhead to reduce round trips by
0.25 seconds. Data transfer rate is improved
50%, and metadata retrieval operations
execute up to 10x faster.
Improved Queuing and Load Balancing
SQL Analytics Endpoints extend Delta Lake’s
capabilities to better handle peaks in query
traffic and high cluster utilization.
Additionally, execution is improved for both
short and long queries.
Spot Instance Pricing
By using spot instances, SQL Analytics
Endpoints provide optimal pricing and
reliability with minimal administration.
“Unified Catalog”
The one version of the truth for your
organization. “Unity Catalog” is the data
catalog, governance, and query monitoring
solution that unifies your data in the cloud.
“Delta Sharing” offers Open Data Sharing, for
securely sharing data in the cloud.
Vectorized Execution Engine
Compiler | High Perf. Async IO | IO Caching
Admin Console
SQL Endpoints
Workload Mgt & Queuing | Auto-scaling | Results Caching
Analyst Experience
"Unified Catalog"
Databricks SQL Analytics

DELTA Lake
ODBC/JDBC
Drivers
BI & SQL Client
Connectors
SQL
End Point
Query
Planning
Query
Execution
Performance - The Databricks BI Stack

Better price / performance
Run SQL queries on your lakehouse
and analyze your freshest data
with up to 4x better
price/performance than
traditional cloud data warehouses.
Source: Performance Benchmark with Barcelona Supercomputing Center

Common Use Cases
Collaborative exploratory data
analysis on your data lake
Data-enhanced
applications
Connect existing BI tools and use
one source of truth for all your
data
Respond to business needs faster with a
self-served experience designed for every
analysts in your organization. Databricks
SQL Analytics provides a simple and
secure access to data, ability to create or
reuse SQL queries to analyze the data that
sits directly on your data lake, and quickly
mock-up and iterate on visualizations and
dashboards that fit best the business.
Build rich and custom data enhanced
applications for your own
organization or your customers.
Simplify development and leverage
the price / performance and scale of
Databricks SQL Analytics, all served
from your data lake.
Maximize existing investments by connecting
your preferred BI tools such as Tableau or
PowerBI to your data lake with SQL Analytics
Endpoints. Re-engineered and optimized
connectors ensure fast performance, low
latency, and high user concurrency to your
data lake. Now analysts can use the best tool
for the job on one single source of truth for
your data: your data lake.

Common Governance Models
Enterprise Data Sources
IT or Governed Datasets
GRANT USAGE, SELECT ON
DATABASE CrownJewels TO users
GRANT USAGE, MODIFY ON
DATABASE CrownJewels TO
admin-read-write
GRANT ALL PRIVILEGES ON
CATALOG TO superuser
Department/Business Unit Data
Business Unit Datasets
DATABASE DepartmentJewels TO
users
DATABASE DepartmentJewels TO
department-read-write
User Level Data
Self-Service Datasets
DATABASE MyJewels TO users
DATABASE MyJewels TO
`franco@databricks.com`

Data Security Governance
Catalog
Database
Table/View/Function
SQL objects in Databricks are hierarchical and
privileges are inherited. This means that
granting or denying a privilege on the
CATALOG automatically grants or denies the
privilege to all databases in the catalog.
Similarly, privileges granted on a DATABASE
object are inherited by all objects in that
database.
To perform an action on a database object, a
user must have the USAGE privilege on that
database in addition to the privilege to
perform that action.
GRANT USAGE, SELECT ON Catalog
TO users
Database D TO users
GRANT USAGE ON DATABASE D to
users;
GRANT SELECT ON TABLE T TO
users;

Grant Access in SQL via Roles
Sync Users + Groups with SCIM
finance read-only
finance-read-write
cs-read-only
cs-read-write
admin-all
Jill, Jon
Jane, Jack
Fred, James, Will
Wilbur
Finance Users
Finance Admin
Jake
Customer Service
Users
Customer Service
Admin
Architect/Admin
GRANT USAGE, SELECT ON DATABASE F
TO finance-read-only
GRANT USAGE, MODIFY ON DATABASE F
TO finance-read-write
GRANT USAGE, SELECT ON DATABASE C
TO cs-read-only
GRANT ALL PRIVILEGES ON CATALOG TO
admin-all
GRANT USAGE, MODIFY ON DATABASE C
TO cs-read-write
Role Based Access Control

Managed
Data Source
Managed
Data Source
Cluster or SQL
Endpoint Managed Catalog
Cross-Workspace
External Tables
SQL access
controls
Audit
log
Defined
Credentials
Other Existing
Data Sources
User Identity
Passthrough
Define Once, Secure Everywhere
Centralized Governance/Unified Security Model-
passthrough and ACLs supported in the same
workspaces for all your data sources
Use exclusively or in tandem with your existing Hive
Metastore
Centralized metastore with integrated fine-grained security across all your workspaces
Databricks Managed Catalog
Announced in Keynote!

Implementation Example with Frictionless ETL

TPC-DI
Data Integration (DI), also known as ETL, is the analysis,
combination, and transformation of data from a variety of
sources and formats into a unified data model
representation. Data Integration is a key element of data
warehousing lakehouseing, application integration, and
business analytics.
http://www.tpc.org/tpcdi/default5.asp

Main Concepts of TPC-DI
TPC-DI uses data integration of a factious Retail Brokerage Firm as model:
● Main Trading System
● Internal Human Resource System
● Internal Customer Relationship Management System
● Externally acquired data
Operations measured use the above model, but are not limited to those of a brokerage firm
They capture the variety and complexity of typical DI tasks:
● Loading of large volumes of historical data
● Loading of incremental updates
● Execution of a variety of transformation types using various input types and various target types with inter-table
relationships
● Assuring consistency of loaded data
Benchmark is technology agnostic

Why TPC-DI?
Data Generator
• Produces scales of files from GBs to TBs
• Produces CSV, CDC, XML, and Text files
• Has historical and incremental
Data Model
• Transformations documented
• Dimensional Model for Analytics

Implementation Reference Architecture
Bronze Silver Gold
OLTP
CDC
Extract Frictionless Load
HR DB
CSV
Frictionless Load
Prospect
List
CSV
Frictionless Load
Financial
Newswire
Multi
Format Frictionless Load
Customers
XML
Frictionless Load
MERGE
INTO

What is Frictionless Loading?
Autoloader
• Load files from cloud object storage
• With notifications
• Structured Streaming
• Trigger Once for Batch equivalency
• Schema Inference
• Schema Hints
• Schema Drift Handling
Delta Lake
• Streaming Source and Sink
• Checkpoint + Transaction Log = Ultimate State Store
• Schema Enforcement and Evolution
• Time Travel
• Optimize for Analytics
• Change Data Feed
Batch mode for migrating existing orchestration, simple code change for near real-time streaming!

Enable everyone to declaratively build robust
data pipelines with testing baked-in. First-class
support for SQL and Python.
Databricks auto-tunes infrastructure and takes
care of orchestration, failures and retries
Define data quality checks and data
documentation in the pipeline
Cost and latency conscious, with support for
streaming vs. batch and full vs. incremental
workloads
A simple way to build and operate ETL data flows to deliver fresh, high quality data
Delta Live Tables (formerly “Delta Pipelines”)
Announced in Keynote
Watch the Breakout from Awez, Delta Live Tables!

Related Talks
WEDNESDAY
03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks
04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn &
Alex Behm, Databricks
04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya,
Plume
THURSDAY
11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics
03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano,
Databricks
FRIDAY
10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast
& Molly Nagamuthu, Databricks

How to get started
On June 1
databricks.com/try

Building Lakehouses on Delta Lake with SQL Analytics Primer

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Building Lakehouses on Delta Lake with SQL Analytics Primer

Semelhante a Building Lakehouses on Delta Lake with SQL Analytics Primer (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Building Lakehouses on Delta Lake with SQL Analytics Primer