SlideShare uma empresa Scribd logo
1 de 46
The Databricks
Platform Introduction
All your data, analytics and
AI on one platform
Alex Ivanichev
March 2022
DataBricks is a unified & open Data
and Analytics Platform
What is DataBricks ?
Modern Data
Teams
5
Data Engineers Data Scientists
Data Analysts
How the data management looks
like today ?
Data management complexity
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering
Streaming
decrease productivity
Data Science and ML
Data Analysts Data Engineers Data Engineers
Disconnected systems and proprietary data formats make integration difficult
Data Scientists
Amazon Redshift
Azure Synapse
Snowflake
SAP
Teradata
Google BigQuery
IBM Db2
Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
Tibco Spotfire Confluent TensorFlow PyTorch
Extract Load
Transform Real-time Database
Analytics and BI
Data marts Data prep
Machine
Learning
Data
Science
Streaming Data Engine
Data Lake Data Lake
Data warehouse
Structured, semi-
structured
and unstructured data
Structured, semi-
structured
and unstructured data
Structured data
Streaming data sources
5
Data Warehouse Data Lake
vs.
Warehouses and lakes create complexity
Two separate copies of the data
Warehouses
Proprietary
Lakes
Open
Incompatible interfaces
Warehouses
SQL
Lakes
Python
Incompatible security and governance models
Warehouses
Tables
Lakes
Files
Data Warehouse Data Lake
Streaming
Analytics
B
I
Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lakehouse
One platform to unify all of
your data, analytics, and AI workloads
Why choose Databricks ?
The data lakehouse offers a better path
Data processing and management built on open source and open
standards
Common security, governance, and administration
Modern Data
Engineering
Analytics and Data
Warehousing
Data Science
and ML
Integrated and collaborative role-based experiences with
open API’s
Cloud Data Lake
Structured, semi-structured, and unstructured data
Lake-first approach that builds upon where
the freshest, most complete data resides
AI/ML from the ground up
High reliability and performance Single
approach to managing data
Support for all use cases on a single
platform:
• Data engineering
• Data warehousing
• Real time streaming
• Data science and ML
Built on open source and open standards
Multi-cloud, work with your cloud of choice
The Data Lakehouse
Foundation
©2021 Databricks Inc. — All rights r eserved
An open approach to bringing data
management and governance to data
lakes
Better reliability with transactions
48x faster data processing with indexing
Data governance at scale with fine-
grained access control lists
Data
Warehouse
Data
Lake
What is Delta Lake?
● A open source project that enables building a Lakehouse architecture on top of data lakes.
● An storage layer that brings scalable, ACID transactions to Apache Spark and other big-data
engines.
● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and
batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.
● ACID Transactions
● Scalable Metadata Handling
● Time Travel (data versioning)
● Open Format
● Delta Lake change data feed
● Unified Batch and Streaming Source and Sink
● Schema Enforcement
● Schema Evolution
● Audit History
● Updates and Delete
● 100% Compatible with Apache Spark API
● Data Clean-up
Key Features:
https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
Delta Lake solves challenges with data lakes
RELIABILITY &
QUALITY
PERFORMANCE &
LATENCY
GOVERNANCE
ACID transactions
Advanced indexing & caching
Governance with Data Catalogs
Delta Lake key feature - ACID transaction
● Add File: It adds the data file
● Remove File: It removes the data file
● Update Metadata: It updates the table metadata.
● Set Transaction: It records that a structure streaming job created a micro-batch with ID
● Change Protocol: Makes more secure by transferring Delta Lakes to the latest securing
protocol.
● Commit Info: It contains the information about the Commits.
State Recomputing With Checkpoint Files
● Delta Lake automatically generates checkpoint files every 10 commits
● Delta Lake saves a checkpoint file in Parquet format in the same _delta_log subdirectory.
Building the foundation of a Lakehouse
Filtered, Cleaned,
Augmented
Business-level
Aggregates
Greatly improve the quality of your data for end users
BRONZE SILVER
GOLD
Raw Ingestion
and History
Kinesis
CSV,
JSON, TXT…
Data Lake
Quality
BI &
Reporting
Streaming
Analytics
Data Science &
ML
But the reality is not so simple
Maintaining data quality and reliability at scale is complex and brittle
CSV,
JSON, TXT…
Data
Lake
Kinesis
BI &
Reporting
Streaming
Analytics
Data Science &
ML
Modern data engineering on the lakehouse
Data Engineering on the Databricks Lakehouse Platform
Open format storage
Data transformation
Scheduling &
orchestration
Automatic deployment & operations
BI / Reporting
Dashboarding
Machine
Learning / Data
Science
Data & ML
Sharing
Data Products
Databases
Streaming
Sources
Cloud Object
Stores
SaaS
Applications
NoSQL
On-premises
Systems
Data Sources Data Consumers
Observability, lineage, and end-to-end pipeline visibility
Data quality management
Data
ingestion
Data Science & Engineering
Workspace
Databricks Workspaces: Clusters
It is a set of computation resources where a developer can run Data Analytics,
Data Science, or Data Engineering workloads.
The workloads can be executed in the form of a set of commands written in a notebook
Databricks Workspaces: Notebooks
It is a Web Interface where a developer can write and execute codes. Notebook contains
a sequence of runnable cells that helps a developer to work with files, manipulate
tables, create visualizations, and add narrative texts
Databricks Workspaces: AutoLoader
Auto Loader incrementally and efficiently processes new data files as they arrive in
cloud storage. Auto Loader can load data files from Google Cloud Storage (GCS, gs://) in
addition to Databricks File System (DBFS, dbfs:/)
** Supports: JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats. **
val checkpoint_path = "/tmp/delta/population_data/_checkpoints"
val write_path = "/tmp/delta/population_data"
// Set up the stream to begin reading incoming files from the
// upload_path location.
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.schema("city string, year int, population long")
.load(upload_path)
// Start the stream.
// Use the checkpoint_path location to keep a record of all files that
// have already been uploaded to the upload_path location.
// For those that have been uploaded since the last check,
// write the newly-uploaded files' data to the write_path location.
df.writeStream.format("delta")
.option("checkpointLocation", checkpoint_path)
.start(write_path)
https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html
Databricks Workspaces: Jobs
Jobs allow a user to run notebooks on a scheduled basis. It is a method of executing or
automating specific tasks like ETL, Model Building, and more.
The pipeline of the ML workflow can be organized
into jobs so that it sequentially runs the series of
steps one after another
Databricks Workspaces:Delta Live Tables
Delta Live Tables is a framework designed to enable declaratively define, deploy, test &
upgrade data pipelines and eliminate operational burdens associated with the
management of such pipelines.
Databricks Workspaces: Repos
To empower the process of ML application development, repo’s provide repository-
level integration with Git-based hosting providers such as GitHub, GitLab, bitBucket,
and Azure DevOps
Developers can write code in a Notebook
and Sync it with the hosting provider,
allowing developers to clone, manage
branches, push changes, pull changes,
etc.
Databricks Workspaces: Models
It refers to a Developer’s ML Workflow Model registered in the MLflow Model Registry,
a centralized model store that manages the entire life cycle of MLflow models.
MLflow Model Registry provides all
the information about modern
lineage, model versioning, present
condition, workflow, and stage
transition (whether promoted to
production or archived).
Governance requirements for
data are quickly evolving
Governance is hard to enforce on data lakes
42
Cloud 2
Cloud 3
Structured
Semi-structured
Unstructured
Streaming
Cloud 1
The problem is getting bigger
Enterprises need a way to share and govern a wide variety of data products
Files Dashboards Models Tables
Unity Catalog for Lakehouse Governance
• Centrally catalog, Search, and discover
data and AI assets
• Simplify governance with a unified Cross- cloud
governance model
• Easily integrate with your existing
Enterprise Data Catalogs
• Securely share live data across platforms
with delta sharing
Delta Sharing on Databricks
Delta Lake
Table
Delta Sharing
Server
Delta Sharing
Protocol
Data
Provider
Data Recipient
Any Sharing Client
Access
permissions
Machine Learning
Workspace
ML Architecture: Data Warehouse VS Data Lakehouse
Data Warehouse Data Lakehouse
Open Multi-Cloud Data Lakehouse and Feature Store
Collaborative Multi-Language Notebooks
← Full ML Lifecycle →
Model Tracking
and Registry
Model Training
and Tuning
Model Serving
and Monitoring
Automation and
Governance
Data Science and Machine Learning
A data-native and collaborative solution for the full ML lifecycle
What Does ML Need from a Lakehouse?
58
Access to Unstructured Data
• Images, text, audio, custom formats
• Libraries understand files, not tables
• Must scale to petabytes
Open Source Libraries
• OSS dominates ML tooling (Tensorflow, scikit-
learn, xgboost, R, etc)
• Must be able to apply these in Python, R
Specialized Hardware, Distributed Compute
• Scalability of algorithms
• GPUs, for deep learning
• Cloud elasticity to manage that cost!
Model Lifecycle Management
• Outputs are model artifacts
• Artifact lineage
• Productionization of model
Three Data Users
• SQL and BI tools
• Prepare and run reports
• Summarize data
• Visualize data
• (Sometimes) Big Data
• Data Warehouse data store
• R, SAS, some Python
• Statistical analysis
• Explain data
• Visualize data
• Often small data sets
• Database, data warehouse
data store; local files
Business Intelligence Data Science
• Python
• Deep learning and
specialized GPU hardware
• Create predictive models
• Deploy models to prod
• Often big data sets
• Unstructured data in files
Machine Learning
How Is ML Different?
• Operates on unstructured data like text
and images
• Can require learning from massive
data sets, not just analysis of a sample
• Uses open source tooling to
manipulate data as “DataFrames”
rather than with SQL
• Outputs are models rather than data or
reports
• Sometimes needs special hardware
MLOps and the Lakehouse
• Applying open tools in-place to data in
the lakehouse is a win for training
• Applying them for operating models is
important too!
• "Models are data too"
• Need to apply models to data
• MLFlow for MLOps on the lakehouse
• Track and manage model data,
lineage, inputs
• Deploy models as lakehouse "services"
Feature Stores for Model Inputs
• Tables are OK for managing model input
• Input often structured
• Well understood, easy to access
• … but not quite enough
• Upstream lineage: how were
features computed?
• Downstream lineage: where is the
feature used?
• Model caller has to read, feed inputs
• How to do (also) access in real
time?
SQL Analytics Workspace
Query data lake data using familiar ANSI SQL, and find and share new insights faster
with the built-in SQL query editor, alerts, visualizations, and interactive dashboards.
Databricks Workspaces: Queries
Provides a simplified control (which is SQL only) to query the data
Databricks Workspaces: Dashboards
A Databricks SQL dashboard lets you combine visualizations and text boxes that provide
context with your data.
Databricks Workspaces: Alerts
Alerts notify you when a field returned by a scheduled query meets a threshold.
Alerts complement scheduled queries, but their criteria are checked after every
execution.
Databricks Workspaces: Query History
The query history shows SQL queries performed using SQL endpoints.
Thank you

Mais conteúdo relacionado

Mais procurados

Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 

Mais procurados (20)

Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 

Semelhante a Databricks Platform.pptx

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionTravis Wright
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 

Semelhante a Databricks Platform.pptx (20)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
What Is Delta Lake ???
What Is Delta Lake ???What Is Delta Lake ???
What Is Delta Lake ???
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 

Último

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 

Último (20)

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 

Databricks Platform.pptx

  • 1. The Databricks Platform Introduction All your data, analytics and AI on one platform Alex Ivanichev March 2022
  • 2. DataBricks is a unified & open Data and Analytics Platform What is DataBricks ?
  • 3. Modern Data Teams 5 Data Engineers Data Scientists Data Analysts
  • 4. How the data management looks like today ?
  • 5. Data management complexity Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming decrease productivity Data Science and ML Data Analysts Data Engineers Data Engineers Disconnected systems and proprietary data formats make integration difficult Data Scientists Amazon Redshift Azure Synapse Snowflake SAP Teradata Google BigQuery IBM Db2 Oracle Autonomous Data Warehouse Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS Tibco Spotfire Confluent TensorFlow PyTorch Extract Load Transform Real-time Database Analytics and BI Data marts Data prep Machine Learning Data Science Streaming Data Engine Data Lake Data Lake Data warehouse Structured, semi- structured and unstructured data Structured, semi- structured and unstructured data Structured data Streaming data sources 5
  • 7. Warehouses and lakes create complexity Two separate copies of the data Warehouses Proprietary Lakes Open Incompatible interfaces Warehouses SQL Lakes Python Incompatible security and governance models Warehouses Tables Lakes Files
  • 8. Data Warehouse Data Lake Streaming Analytics B I Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lakehouse One platform to unify all of your data, analytics, and AI workloads
  • 10. The data lakehouse offers a better path Data processing and management built on open source and open standards Common security, governance, and administration Modern Data Engineering Analytics and Data Warehousing Data Science and ML Integrated and collaborative role-based experiences with open API’s Cloud Data Lake Structured, semi-structured, and unstructured data Lake-first approach that builds upon where the freshest, most complete data resides AI/ML from the ground up High reliability and performance Single approach to managing data Support for all use cases on a single platform: • Data engineering • Data warehousing • Real time streaming • Data science and ML Built on open source and open standards Multi-cloud, work with your cloud of choice
  • 12. ©2021 Databricks Inc. — All rights r eserved An open approach to bringing data management and governance to data lakes Better reliability with transactions 48x faster data processing with indexing Data governance at scale with fine- grained access control lists Data Warehouse Data Lake
  • 13. What is Delta Lake? ● A open source project that enables building a Lakehouse architecture on top of data lakes. ● An storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. ● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. ● ACID Transactions ● Scalable Metadata Handling ● Time Travel (data versioning) ● Open Format ● Delta Lake change data feed ● Unified Batch and Streaming Source and Sink ● Schema Enforcement ● Schema Evolution ● Audit History ● Updates and Delete ● 100% Compatible with Apache Spark API ● Data Clean-up Key Features: https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
  • 14. Delta Lake solves challenges with data lakes RELIABILITY & QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing & caching Governance with Data Catalogs
  • 15. Delta Lake key feature - ACID transaction ● Add File: It adds the data file ● Remove File: It removes the data file ● Update Metadata: It updates the table metadata. ● Set Transaction: It records that a structure streaming job created a micro-batch with ID ● Change Protocol: Makes more secure by transferring Delta Lakes to the latest securing protocol. ● Commit Info: It contains the information about the Commits.
  • 16. State Recomputing With Checkpoint Files ● Delta Lake automatically generates checkpoint files every 10 commits ● Delta Lake saves a checkpoint file in Parquet format in the same _delta_log subdirectory.
  • 17. Building the foundation of a Lakehouse Filtered, Cleaned, Augmented Business-level Aggregates Greatly improve the quality of your data for end users BRONZE SILVER GOLD Raw Ingestion and History Kinesis CSV, JSON, TXT… Data Lake Quality BI & Reporting Streaming Analytics Data Science & ML
  • 18. But the reality is not so simple Maintaining data quality and reliability at scale is complex and brittle CSV, JSON, TXT… Data Lake Kinesis BI & Reporting Streaming Analytics Data Science & ML
  • 19. Modern data engineering on the lakehouse Data Engineering on the Databricks Lakehouse Platform Open format storage Data transformation Scheduling & orchestration Automatic deployment & operations BI / Reporting Dashboarding Machine Learning / Data Science Data & ML Sharing Data Products Databases Streaming Sources Cloud Object Stores SaaS Applications NoSQL On-premises Systems Data Sources Data Consumers Observability, lineage, and end-to-end pipeline visibility Data quality management Data ingestion
  • 20. Data Science & Engineering Workspace
  • 21. Databricks Workspaces: Clusters It is a set of computation resources where a developer can run Data Analytics, Data Science, or Data Engineering workloads. The workloads can be executed in the form of a set of commands written in a notebook
  • 22. Databricks Workspaces: Notebooks It is a Web Interface where a developer can write and execute codes. Notebook contains a sequence of runnable cells that helps a developer to work with files, manipulate tables, create visualizations, and add narrative texts
  • 23. Databricks Workspaces: AutoLoader Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader can load data files from Google Cloud Storage (GCS, gs://) in addition to Databricks File System (DBFS, dbfs:/) ** Supports: JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats. ** val checkpoint_path = "/tmp/delta/population_data/_checkpoints" val write_path = "/tmp/delta/population_data" // Set up the stream to begin reading incoming files from the // upload_path location. val df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "csv") .option("header", "true") .schema("city string, year int, population long") .load(upload_path) // Start the stream. // Use the checkpoint_path location to keep a record of all files that // have already been uploaded to the upload_path location. // For those that have been uploaded since the last check, // write the newly-uploaded files' data to the write_path location. df.writeStream.format("delta") .option("checkpointLocation", checkpoint_path) .start(write_path) https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html
  • 24. Databricks Workspaces: Jobs Jobs allow a user to run notebooks on a scheduled basis. It is a method of executing or automating specific tasks like ETL, Model Building, and more. The pipeline of the ML workflow can be organized into jobs so that it sequentially runs the series of steps one after another
  • 25. Databricks Workspaces:Delta Live Tables Delta Live Tables is a framework designed to enable declaratively define, deploy, test & upgrade data pipelines and eliminate operational burdens associated with the management of such pipelines.
  • 26. Databricks Workspaces: Repos To empower the process of ML application development, repo’s provide repository- level integration with Git-based hosting providers such as GitHub, GitLab, bitBucket, and Azure DevOps Developers can write code in a Notebook and Sync it with the hosting provider, allowing developers to clone, manage branches, push changes, pull changes, etc.
  • 27. Databricks Workspaces: Models It refers to a Developer’s ML Workflow Model registered in the MLflow Model Registry, a centralized model store that manages the entire life cycle of MLflow models. MLflow Model Registry provides all the information about modern lineage, model versioning, present condition, workflow, and stage transition (whether promoted to production or archived).
  • 28. Governance requirements for data are quickly evolving
  • 29. Governance is hard to enforce on data lakes 42 Cloud 2 Cloud 3 Structured Semi-structured Unstructured Streaming Cloud 1
  • 30. The problem is getting bigger Enterprises need a way to share and govern a wide variety of data products Files Dashboards Models Tables
  • 31. Unity Catalog for Lakehouse Governance • Centrally catalog, Search, and discover data and AI assets • Simplify governance with a unified Cross- cloud governance model • Easily integrate with your existing Enterprise Data Catalogs • Securely share live data across platforms with delta sharing
  • 32. Delta Sharing on Databricks Delta Lake Table Delta Sharing Server Delta Sharing Protocol Data Provider Data Recipient Any Sharing Client Access permissions
  • 34. ML Architecture: Data Warehouse VS Data Lakehouse Data Warehouse Data Lakehouse
  • 35. Open Multi-Cloud Data Lakehouse and Feature Store Collaborative Multi-Language Notebooks ← Full ML Lifecycle → Model Tracking and Registry Model Training and Tuning Model Serving and Monitoring Automation and Governance Data Science and Machine Learning A data-native and collaborative solution for the full ML lifecycle
  • 36. What Does ML Need from a Lakehouse? 58 Access to Unstructured Data • Images, text, audio, custom formats • Libraries understand files, not tables • Must scale to petabytes Open Source Libraries • OSS dominates ML tooling (Tensorflow, scikit- learn, xgboost, R, etc) • Must be able to apply these in Python, R Specialized Hardware, Distributed Compute • Scalability of algorithms • GPUs, for deep learning • Cloud elasticity to manage that cost! Model Lifecycle Management • Outputs are model artifacts • Artifact lineage • Productionization of model
  • 37. Three Data Users • SQL and BI tools • Prepare and run reports • Summarize data • Visualize data • (Sometimes) Big Data • Data Warehouse data store • R, SAS, some Python • Statistical analysis • Explain data • Visualize data • Often small data sets • Database, data warehouse data store; local files Business Intelligence Data Science • Python • Deep learning and specialized GPU hardware • Create predictive models • Deploy models to prod • Often big data sets • Unstructured data in files Machine Learning
  • 38. How Is ML Different? • Operates on unstructured data like text and images • Can require learning from massive data sets, not just analysis of a sample • Uses open source tooling to manipulate data as “DataFrames” rather than with SQL • Outputs are models rather than data or reports • Sometimes needs special hardware
  • 39. MLOps and the Lakehouse • Applying open tools in-place to data in the lakehouse is a win for training • Applying them for operating models is important too! • "Models are data too" • Need to apply models to data • MLFlow for MLOps on the lakehouse • Track and manage model data, lineage, inputs • Deploy models as lakehouse "services"
  • 40. Feature Stores for Model Inputs • Tables are OK for managing model input • Input often structured • Well understood, easy to access • … but not quite enough • Upstream lineage: how were features computed? • Downstream lineage: where is the feature used? • Model caller has to read, feed inputs • How to do (also) access in real time?
  • 41. SQL Analytics Workspace Query data lake data using familiar ANSI SQL, and find and share new insights faster with the built-in SQL query editor, alerts, visualizations, and interactive dashboards.
  • 42. Databricks Workspaces: Queries Provides a simplified control (which is SQL only) to query the data
  • 43. Databricks Workspaces: Dashboards A Databricks SQL dashboard lets you combine visualizations and text boxes that provide context with your data.
  • 44. Databricks Workspaces: Alerts Alerts notify you when a field returned by a scheduled query meets a threshold. Alerts complement scheduled queries, but their criteria are checked after every execution.
  • 45. Databricks Workspaces: Query History The query history shows SQL queries performed using SQL endpoints.