SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Building Lakehouses
on Delta Lake
and SQL Analytics-
A Primer
Franco Patano
Senior Solutions Architect, Databricks
@fpatano
linkedin.com/in/francopatano/
Wayne Dyer
If you believe it will work out, you’ll
see opportunities. If you believe it
won’t you’ll see obstacles.
Agenda
▪ What is Lakehouse
▪ Delta Lake Architecture
▪ Delta Engine Optimizations
▪ SQL Analytics
▪ Implementation Example
▪ Frictionless Loading
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Streaming
Batch
One platform to unify all of
your data, analytics, and AI workloads
Filtered, Cleaned,
Augmented
Silver
Business-level
Aggregates
Gold
Semi-structured
Unstructured
Structured
Raw Ingestion
and History
Bronze
Implementing Lakehouse Architecture with
Delta Lake
Bronze Silver Gold
Data usability
Raw Ingestion,
and History
Filtered, Cleaned,
Augmented
Business-level
Aggregates
AutoLoader
Structured
Streaming
Batch
COPY INTO
Partners
Land data as it is received
Provenance to source
Handle NULLS
Fix bad dates (1970-01-01)
Clean text fields
Demux nested objects
Friendly field names
Analytics Engineering
Business Models
Aggregates for visible dimensions
Business friendly field names
Common logical views
Table Structure
Stats are only collected on the first 32 ordinal
fields, including fields in nested structures
• You can change this with this property:
dataSkippingNumIndexedCols
Restructure data accordingly
• Move numericals, keys, high cardinality query predicates
to the left, long strings that are not distinct enough for
stats collection, and date/time stamps to the right past the
dataSkippingNumIndexedCols
• Long strings are kryptonite to stats collection, move these
to past the 32nd position, or past
dataSkippingNumIndexedCols
Numerical, Keys, High Cardinality Long Strings, Date/Time
32 columns or dataSkippingNumIndexedCols
Optimize and Z-Order
Optimize will bin pack our files for better read performance
Z-Order will organize our data for better data skipping
What fields should you Z-Order by?
Fields that are being joined on, or included in a predicate
• Primary Key , Foriegn Keys on dim and fact tables
• ID fields that are joined to other tables
• High Cardinality fields used in query predicates
Partitioning and Z-Order effectiveness
High Cardinality Regular Cardinality Low Cardinality
Very Uncommon or Unique Datum
● User or Device ID
● Email Address
● Phone Number
Common Repeatable Data
● People or Object Names
● Street Addresses
● Categories
Repeatable, limited distinct data
● Gender
● Status Flags
● Boolean Values
SELECT COUNT(DISTINCT(x))
Partitioning effectiveness
Z-Order effectiveness
Tips for each layer
✓ When files, land raw
✓ When streaming, land in delta raw
✓ Turn off stats collection
○ dataSkippingNumIndexedCols 0
✓ Optimize and Z-Order by merge
join keys between Bronze and
Silver
✓ Restructure columns to account
for data skipping index columns
✓ Use Delta Cache Enabled clusters
○ or enable it for other types YMMV
✓ Optimize and Z-Order by join keys
or common High Cardinality query
predicates
✓ Turn up Staleness Limit to align
with your orchestration
✓ Use SQL Analytics for Analysts
Business-level
Aggregates
Filtered, Cleaned,
Augmented
Raw Ingestion,
and History
Bronze Silver Gold
MERGE INTO
JOIN KEYS
MERGE INTO
JOIN KEYS
Databricks SQL Analytics
Delivering analytics on the freshest data
with data warehouse performance and
data lake economics
• Query your lakehouse with better price / performance
• Simplify discovery and sharing of new insights
• Connect to familiar BI tools, like Tableau or Power BI
• Simplify administration and governance
Why did Databricks Create SQL Analytics?
➔ Customers have standardized on data lakes
as a foundation for modern data analytics
➔ ~41% of queries on Databricks are SQL
➔ SQL Analytics was created to provide these
users with a familiar SQL editor experience
Easy to use SQL experience
Enable data analysts to quickly
perform ad-hoc and exploratory
data analysis, with a new and easy
to use SQL query editor, built-in
visualizations and dashboards.
Automatic alerts can be triggered
for critical changes, allowing to
respond to business needs faster.
Simple administration and governance
Quickly setup SQL / BI
optimized compute with SQL
endpoints. Databricks automatically
determines instance types and
configuration for the best
price/performance. Then, easily
manage usage, perform quick
auditing, and troubleshooting with
query history.
Curated Data
Delta Lake helps build curated data lakes, so
you can store and manage all your data in
one place and standardize your big data
storage with an open format accessible from
various tools.
Curated Data
Structured, Semi-Structured, and Unstructured Data
Filtered, Cleaned,
Augmented
Silver
Raw Ingestion
and History
Bronze
Business-level
Aggregates
Gold
SQL Editor: Based on Redash, the SQL
Native Interface provides a simple and
familiar experience for data analysts to
explore data, query their data lakes,
visualize results, share dashboards,
and setup automatic alerts.
Query Execution: SQL Analytics’s compute
clusters are powered by Photon Engine.
100% Apache Spark-compatible vectorized
query engine designed to take advantage of
modern CPU architecture for extremely fast
parallel processing of data
Optimized ODBC/JDBC Drivers
Re-engineered drivers provide lower latency
and less overhead to reduce round trips by
0.25 seconds. Data transfer rate is improved
50%, and metadata retrieval operations
execute up to 10x faster.
Improved Queuing and Load Balancing
SQL Analytics Endpoints extend Delta Lake’s
capabilities to better handle peaks in query
traffic and high cluster utilization.
Additionally, execution is improved for both
short and long queries.
Spot Instance Pricing
By using spot instances, SQL Analytics
Endpoints provide optimal pricing and
reliability with minimal administration.
“Unified Catalog”
The one version of the truth for your
organization. “Unity Catalog” is the data
catalog, governance, and query monitoring
solution that unifies your data in the cloud.
“Delta Sharing” offers Open Data Sharing, for
securely sharing data in the cloud.
Vectorized Execution Engine
Compiler | High Perf. Async IO | IO Caching
Admin Console
SQL Endpoints
Workload Mgt & Queuing | Auto-scaling | Results Caching
Analyst Experience
"Unified Catalog"
Databricks SQL Analytics
DELTA Lake
ODBC/JDBC
Drivers
BI & SQL Client
Connectors
SQL
End Point
Query
Planning
Query
Execution
Performance - The Databricks BI Stack
Better price / performance
Run SQL queries on your lakehouse
and analyze your freshest data
with up to 4x better
price/performance than
traditional cloud data warehouses.
Source: Performance Benchmark with Barcelona Supercomputing Center
Common Use Cases
Collaborative exploratory data
analysis on your data lake
Data-enhanced
applications
Connect existing BI tools and use
one source of truth for all your
data
Respond to business needs faster with a
self-served experience designed for every
analysts in your organization. Databricks
SQL Analytics provides a simple and
secure access to data, ability to create or
reuse SQL queries to analyze the data that
sits directly on your data lake, and quickly
mock-up and iterate on visualizations and
dashboards that fit best the business.
Build rich and custom data enhanced
applications for your own
organization or your customers.
Simplify development and leverage
the price / performance and scale of
Databricks SQL Analytics, all served
from your data lake.
Maximize existing investments by connecting
your preferred BI tools such as Tableau or
PowerBI to your data lake with SQL Analytics
Endpoints. Re-engineered and optimized
connectors ensure fast performance, low
latency, and high user concurrency to your
data lake. Now analysts can use the best tool
for the job on one single source of truth for
your data: your data lake.
Common Governance Models
Enterprise Data Sources
IT or Governed Datasets
GRANT USAGE, SELECT ON
DATABASE CrownJewels TO users
GRANT USAGE, MODIFY ON
DATABASE CrownJewels TO
admin-read-write
GRANT ALL PRIVILEGES ON
CATALOG TO superuser
Department/Business Unit Data
Business Unit Datasets
GRANT USAGE, SELECT ON
DATABASE DepartmentJewels TO
users
GRANT USAGE, MODIFY ON
DATABASE DepartmentJewels TO
department-read-write
User Level Data
Self-Service Datasets
GRANT USAGE, SELECT ON
DATABASE MyJewels TO users
GRANT USAGE, MODIFY ON
DATABASE MyJewels TO
`franco@databricks.com`
Data Security Governance
Catalog
Database
Table/View/Function
SQL objects in Databricks are hierarchical and
privileges are inherited. This means that
granting or denying a privilege on the
CATALOG automatically grants or denies the
privilege to all databases in the catalog.
Similarly, privileges granted on a DATABASE
object are inherited by all objects in that
database.
To perform an action on a database object, a
user must have the USAGE privilege on that
database in addition to the privilege to
perform that action.
GRANT USAGE, SELECT ON Catalog
TO users
GRANT USAGE, SELECT ON
Database D TO users
GRANT USAGE ON DATABASE D to
users;
GRANT SELECT ON TABLE T TO
users;
Grant Access in SQL via Roles
Sync Users + Groups with SCIM
finance read-only
finance-read-write
cs-read-only
cs-read-write
admin-all
Jill, Jon
Jane, Jack
Fred, James, Will
Wilbur
Finance Users
Finance Admin
Jake
Customer Service
Users
Customer Service
Admin
Architect/Admin
GRANT USAGE, SELECT ON DATABASE F
TO finance-read-only
GRANT USAGE, MODIFY ON DATABASE F
TO finance-read-write
GRANT USAGE, SELECT ON DATABASE C
TO cs-read-only
GRANT ALL PRIVILEGES ON CATALOG TO
admin-all
GRANT USAGE, MODIFY ON DATABASE C
TO cs-read-write
Role Based Access Control
Managed
Data Source
Managed
Data Source
Cluster or SQL
Endpoint Managed Catalog
Cross-Workspace
External Tables
SQL access
controls
Audit
log
Defined
Credentials
Other Existing
Data Sources
User Identity
Passthrough
Define Once, Secure Everywhere
Centralized Governance/Unified Security Model-
passthrough and ACLs supported in the same
workspaces for all your data sources
Use exclusively or in tandem with your existing Hive
Metastore
Centralized metastore with integrated fine-grained security across all your workspaces
Databricks Managed Catalog
Announced in Keynote!
Implementation Example with Frictionless ETL
TPC-DI
Data Integration (DI), also known as ETL, is the analysis,
combination, and transformation of data from a variety of
sources and formats into a unified data model
representation. Data Integration is a key element of data
warehousing lakehouseing, application integration, and
business analytics.
http://www.tpc.org/tpcdi/default5.asp
Main Concepts of TPC-DI
TPC-DI uses data integration of a factious Retail Brokerage Firm as model:
● Main Trading System
● Internal Human Resource System
● Internal Customer Relationship Management System
● Externally acquired data
Operations measured use the above model, but are not limited to those of a brokerage firm
They capture the variety and complexity of typical DI tasks:
● Loading of large volumes of historical data
● Loading of incremental updates
● Execution of a variety of transformation types using various input types and various target types with inter-table
relationships
● Assuring consistency of loaded data
Benchmark is technology agnostic
Why TPC-DI?
Data Generator
• Produces scales of files from GBs to TBs
• Produces CSV, CDC, XML, and Text files
• Has historical and incremental
Data Model
• Transformations documented
• Dimensional Model for Analytics
Implementation Reference Architecture
Bronze Silver Gold
OLTP
CDC
Extract Frictionless Load
HR DB
CSV
Frictionless Load
Prospect
List
CSV
Frictionless Load
Financial
Newswire
Multi
Format Frictionless Load
Customers
XML
Frictionless Load
MERGE
INTO
What is Frictionless Loading?
Autoloader
• Load files from cloud object storage
• With notifications
• Structured Streaming
• Trigger Once for Batch equivalency
• Schema Inference
• Schema Hints
• Schema Drift Handling
Delta Lake
• Streaming Source and Sink
• Checkpoint + Transaction Log = Ultimate State Store
• Schema Enforcement and Evolution
• Time Travel
• Optimize for Analytics
• Change Data Feed
Batch mode for migrating existing orchestration, simple code change for near real-time streaming!
Enable everyone to declaratively build robust
data pipelines with testing baked-in. First-class
support for SQL and Python.
Databricks auto-tunes infrastructure and takes
care of orchestration, failures and retries
Define data quality checks and data
documentation in the pipeline
Cost and latency conscious, with support for
streaming vs. batch and full vs. incremental
workloads
A simple way to build and operate ETL data flows to deliver fresh, high quality data
Delta Live Tables (formerly “Delta Pipelines”)
Announced in Keynote
Watch the Breakout from Awez, Delta Live Tables!
Demo Time
Related Talks
WEDNESDAY
03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks
04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn &
Alex Behm, Databricks
04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya,
Plume
THURSDAY
11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics
03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano,
Databricks
FRIDAY
10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast
& Molly Nagamuthu, Databricks
How to get started
On June 1
databricks.com/try

Mais conteúdo relacionado

Mais procurados

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motionconfluent
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceDenodo
 

Mais procurados (20)

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and Governance
 

Semelhante a Building Lakehouses on Delta Lake with SQL Analytics Primer

SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServiceswebuploader
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSBIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSTIBCO Spotfire
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekMark Kromer
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 

Semelhante a Building Lakehouses on Delta Lake with SQL Analytics Primer (20)

SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServices
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSBIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 

Mais de Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 

Mais de Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 

Último

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Último (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Building Lakehouses on Delta Lake with SQL Analytics Primer

  • 1. Building Lakehouses on Delta Lake and SQL Analytics- A Primer Franco Patano Senior Solutions Architect, Databricks @fpatano linkedin.com/in/francopatano/
  • 2. Wayne Dyer If you believe it will work out, you’ll see opportunities. If you believe it won’t you’ll see obstacles.
  • 3. Agenda ▪ What is Lakehouse ▪ Delta Lake Architecture ▪ Delta Engine Optimizations ▪ SQL Analytics ▪ Implementation Example ▪ Frictionless Loading
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 5. Streaming Batch One platform to unify all of your data, analytics, and AI workloads Filtered, Cleaned, Augmented Silver Business-level Aggregates Gold Semi-structured Unstructured Structured Raw Ingestion and History Bronze
  • 6. Implementing Lakehouse Architecture with Delta Lake Bronze Silver Gold Data usability Raw Ingestion, and History Filtered, Cleaned, Augmented Business-level Aggregates AutoLoader Structured Streaming Batch COPY INTO Partners Land data as it is received Provenance to source Handle NULLS Fix bad dates (1970-01-01) Clean text fields Demux nested objects Friendly field names Analytics Engineering Business Models Aggregates for visible dimensions Business friendly field names Common logical views
  • 7. Table Structure Stats are only collected on the first 32 ordinal fields, including fields in nested structures • You can change this with this property: dataSkippingNumIndexedCols Restructure data accordingly • Move numericals, keys, high cardinality query predicates to the left, long strings that are not distinct enough for stats collection, and date/time stamps to the right past the dataSkippingNumIndexedCols • Long strings are kryptonite to stats collection, move these to past the 32nd position, or past dataSkippingNumIndexedCols Numerical, Keys, High Cardinality Long Strings, Date/Time 32 columns or dataSkippingNumIndexedCols
  • 8. Optimize and Z-Order Optimize will bin pack our files for better read performance Z-Order will organize our data for better data skipping What fields should you Z-Order by? Fields that are being joined on, or included in a predicate • Primary Key , Foriegn Keys on dim and fact tables • ID fields that are joined to other tables • High Cardinality fields used in query predicates
  • 9. Partitioning and Z-Order effectiveness High Cardinality Regular Cardinality Low Cardinality Very Uncommon or Unique Datum ● User or Device ID ● Email Address ● Phone Number Common Repeatable Data ● People or Object Names ● Street Addresses ● Categories Repeatable, limited distinct data ● Gender ● Status Flags ● Boolean Values SELECT COUNT(DISTINCT(x)) Partitioning effectiveness Z-Order effectiveness
  • 10. Tips for each layer ✓ When files, land raw ✓ When streaming, land in delta raw ✓ Turn off stats collection ○ dataSkippingNumIndexedCols 0 ✓ Optimize and Z-Order by merge join keys between Bronze and Silver ✓ Restructure columns to account for data skipping index columns ✓ Use Delta Cache Enabled clusters ○ or enable it for other types YMMV ✓ Optimize and Z-Order by join keys or common High Cardinality query predicates ✓ Turn up Staleness Limit to align with your orchestration ✓ Use SQL Analytics for Analysts Business-level Aggregates Filtered, Cleaned, Augmented Raw Ingestion, and History Bronze Silver Gold MERGE INTO JOIN KEYS MERGE INTO JOIN KEYS
  • 11. Databricks SQL Analytics Delivering analytics on the freshest data with data warehouse performance and data lake economics • Query your lakehouse with better price / performance • Simplify discovery and sharing of new insights • Connect to familiar BI tools, like Tableau or Power BI • Simplify administration and governance
  • 12. Why did Databricks Create SQL Analytics? ➔ Customers have standardized on data lakes as a foundation for modern data analytics ➔ ~41% of queries on Databricks are SQL ➔ SQL Analytics was created to provide these users with a familiar SQL editor experience
  • 13. Easy to use SQL experience Enable data analysts to quickly perform ad-hoc and exploratory data analysis, with a new and easy to use SQL query editor, built-in visualizations and dashboards. Automatic alerts can be triggered for critical changes, allowing to respond to business needs faster.
  • 14. Simple administration and governance Quickly setup SQL / BI optimized compute with SQL endpoints. Databricks automatically determines instance types and configuration for the best price/performance. Then, easily manage usage, perform quick auditing, and troubleshooting with query history.
  • 15. Curated Data Delta Lake helps build curated data lakes, so you can store and manage all your data in one place and standardize your big data storage with an open format accessible from various tools. Curated Data Structured, Semi-Structured, and Unstructured Data Filtered, Cleaned, Augmented Silver Raw Ingestion and History Bronze Business-level Aggregates Gold SQL Editor: Based on Redash, the SQL Native Interface provides a simple and familiar experience for data analysts to explore data, query their data lakes, visualize results, share dashboards, and setup automatic alerts. Query Execution: SQL Analytics’s compute clusters are powered by Photon Engine. 100% Apache Spark-compatible vectorized query engine designed to take advantage of modern CPU architecture for extremely fast parallel processing of data Optimized ODBC/JDBC Drivers Re-engineered drivers provide lower latency and less overhead to reduce round trips by 0.25 seconds. Data transfer rate is improved 50%, and metadata retrieval operations execute up to 10x faster. Improved Queuing and Load Balancing SQL Analytics Endpoints extend Delta Lake’s capabilities to better handle peaks in query traffic and high cluster utilization. Additionally, execution is improved for both short and long queries. Spot Instance Pricing By using spot instances, SQL Analytics Endpoints provide optimal pricing and reliability with minimal administration. “Unified Catalog” The one version of the truth for your organization. “Unity Catalog” is the data catalog, governance, and query monitoring solution that unifies your data in the cloud. “Delta Sharing” offers Open Data Sharing, for securely sharing data in the cloud. Vectorized Execution Engine Compiler | High Perf. Async IO | IO Caching Admin Console SQL Endpoints Workload Mgt & Queuing | Auto-scaling | Results Caching Analyst Experience "Unified Catalog" Databricks SQL Analytics
  • 16. DELTA Lake ODBC/JDBC Drivers BI & SQL Client Connectors SQL End Point Query Planning Query Execution Performance - The Databricks BI Stack
  • 17. Better price / performance Run SQL queries on your lakehouse and analyze your freshest data with up to 4x better price/performance than traditional cloud data warehouses. Source: Performance Benchmark with Barcelona Supercomputing Center
  • 18. Common Use Cases Collaborative exploratory data analysis on your data lake Data-enhanced applications Connect existing BI tools and use one source of truth for all your data Respond to business needs faster with a self-served experience designed for every analysts in your organization. Databricks SQL Analytics provides a simple and secure access to data, ability to create or reuse SQL queries to analyze the data that sits directly on your data lake, and quickly mock-up and iterate on visualizations and dashboards that fit best the business. Build rich and custom data enhanced applications for your own organization or your customers. Simplify development and leverage the price / performance and scale of Databricks SQL Analytics, all served from your data lake. Maximize existing investments by connecting your preferred BI tools such as Tableau or PowerBI to your data lake with SQL Analytics Endpoints. Re-engineered and optimized connectors ensure fast performance, low latency, and high user concurrency to your data lake. Now analysts can use the best tool for the job on one single source of truth for your data: your data lake.
  • 19. Common Governance Models Enterprise Data Sources IT or Governed Datasets GRANT USAGE, SELECT ON DATABASE CrownJewels TO users GRANT USAGE, MODIFY ON DATABASE CrownJewels TO admin-read-write GRANT ALL PRIVILEGES ON CATALOG TO superuser Department/Business Unit Data Business Unit Datasets GRANT USAGE, SELECT ON DATABASE DepartmentJewels TO users GRANT USAGE, MODIFY ON DATABASE DepartmentJewels TO department-read-write User Level Data Self-Service Datasets GRANT USAGE, SELECT ON DATABASE MyJewels TO users GRANT USAGE, MODIFY ON DATABASE MyJewels TO `franco@databricks.com`
  • 20. Data Security Governance Catalog Database Table/View/Function SQL objects in Databricks are hierarchical and privileges are inherited. This means that granting or denying a privilege on the CATALOG automatically grants or denies the privilege to all databases in the catalog. Similarly, privileges granted on a DATABASE object are inherited by all objects in that database. To perform an action on a database object, a user must have the USAGE privilege on that database in addition to the privilege to perform that action. GRANT USAGE, SELECT ON Catalog TO users GRANT USAGE, SELECT ON Database D TO users GRANT USAGE ON DATABASE D to users; GRANT SELECT ON TABLE T TO users;
  • 21. Grant Access in SQL via Roles Sync Users + Groups with SCIM finance read-only finance-read-write cs-read-only cs-read-write admin-all Jill, Jon Jane, Jack Fred, James, Will Wilbur Finance Users Finance Admin Jake Customer Service Users Customer Service Admin Architect/Admin GRANT USAGE, SELECT ON DATABASE F TO finance-read-only GRANT USAGE, MODIFY ON DATABASE F TO finance-read-write GRANT USAGE, SELECT ON DATABASE C TO cs-read-only GRANT ALL PRIVILEGES ON CATALOG TO admin-all GRANT USAGE, MODIFY ON DATABASE C TO cs-read-write Role Based Access Control
  • 22. Managed Data Source Managed Data Source Cluster or SQL Endpoint Managed Catalog Cross-Workspace External Tables SQL access controls Audit log Defined Credentials Other Existing Data Sources User Identity Passthrough Define Once, Secure Everywhere Centralized Governance/Unified Security Model- passthrough and ACLs supported in the same workspaces for all your data sources Use exclusively or in tandem with your existing Hive Metastore Centralized metastore with integrated fine-grained security across all your workspaces Databricks Managed Catalog Announced in Keynote!
  • 23. Implementation Example with Frictionless ETL
  • 24. TPC-DI Data Integration (DI), also known as ETL, is the analysis, combination, and transformation of data from a variety of sources and formats into a unified data model representation. Data Integration is a key element of data warehousing lakehouseing, application integration, and business analytics. http://www.tpc.org/tpcdi/default5.asp
  • 25. Main Concepts of TPC-DI TPC-DI uses data integration of a factious Retail Brokerage Firm as model: ● Main Trading System ● Internal Human Resource System ● Internal Customer Relationship Management System ● Externally acquired data Operations measured use the above model, but are not limited to those of a brokerage firm They capture the variety and complexity of typical DI tasks: ● Loading of large volumes of historical data ● Loading of incremental updates ● Execution of a variety of transformation types using various input types and various target types with inter-table relationships ● Assuring consistency of loaded data Benchmark is technology agnostic
  • 26. Why TPC-DI? Data Generator • Produces scales of files from GBs to TBs • Produces CSV, CDC, XML, and Text files • Has historical and incremental Data Model • Transformations documented • Dimensional Model for Analytics
  • 27. Implementation Reference Architecture Bronze Silver Gold OLTP CDC Extract Frictionless Load HR DB CSV Frictionless Load Prospect List CSV Frictionless Load Financial Newswire Multi Format Frictionless Load Customers XML Frictionless Load MERGE INTO
  • 28. What is Frictionless Loading? Autoloader • Load files from cloud object storage • With notifications • Structured Streaming • Trigger Once for Batch equivalency • Schema Inference • Schema Hints • Schema Drift Handling Delta Lake • Streaming Source and Sink • Checkpoint + Transaction Log = Ultimate State Store • Schema Enforcement and Evolution • Time Travel • Optimize for Analytics • Change Data Feed Batch mode for migrating existing orchestration, simple code change for near real-time streaming!
  • 29. Enable everyone to declaratively build robust data pipelines with testing baked-in. First-class support for SQL and Python. Databricks auto-tunes infrastructure and takes care of orchestration, failures and retries Define data quality checks and data documentation in the pipeline Cost and latency conscious, with support for streaming vs. batch and full vs. incremental workloads A simple way to build and operate ETL data flows to deliver fresh, high quality data Delta Live Tables (formerly “Delta Pipelines”) Announced in Keynote Watch the Breakout from Awez, Delta Live Tables!
  • 31. Related Talks WEDNESDAY 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm, Databricks 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume THURSDAY 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks FRIDAY 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast & Molly Nagamuthu, Databricks
  • 32. How to get started On June 1 databricks.com/try