SlideShare a Scribd company logo
1 of 28
Download to read offline
Azure Data Lake Store
INGEST STORE PREP & TRAIN MODEL & SERVE
Modern Data Warehouse
Logs (unstructured)
Azure Data Factory
Azure Databricks
Media (unstructured)
Files (unstructured)
PolyBase
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BIData Lake
?
?
?
?
Why data lakes?
ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
4
Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
5
Data Lake Store: Technical Requirements
6
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.
Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc.
No one analytic framework can work for all data and all types of analysis.
Multiple analytic
frameworks
Details Must be able to store data with all details; aggregation may lead to loss of details.
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark
Reliable Must be highly available and reliable (no permanent loss of data).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
HDInsight
Azure Data Lake Analytics
Big Data analytics workloads
A highly scalable, distributed, parallel file system in the cloud
specifically designed to work with a variety of big data analytics workloads
Azure Data Lake Store
LOB Applications
Social
Devices
Clickstream
Sensors
Video
Web
Relational
Batch
MapReduce
Script
Pig
SQL
Hive
NoSQL
HBase
In-Memory
Spark
Predictive
R Server
Batch
U-SQL
Azure Data Lake Store
Overview
WebHDFS
YARN
U-SQL
(extensible by C#, R and Python)
Analytics HDInsight
Store
Azure Data Lake (Gen1)
Blob Storage +
(WASB)
Azure Data Lake
Store (ADLS)
Blob Storage +
(WASB)
Rich
Security
Speed to
Insight
Cost
Effectiveness
Scale and
Availability
Cost
Effectiveness
Scale and
Availability
Azure Data Lake Storage Gen2
Rich
Security
Speed to
Insight
Cost
Effectiveness
Scale and
Availability
Scalable, secure storage that
speeds time to insight
WASB ADLSWASB
Azure Data Lake Storage Gen2: Single Data Lake Store that combines the performance and
innovation of ADLS with the scale and rich feature set of Blob Storage
Azure Data Lake Storage Gen2 Key Features
Azure Data Lake Store Gen2 (Preview)
Demo:
Provisioning a Data Lake
Scale, Performance, &
Reliability
Azure Data Lake Store: no scale limits
Azure Data Lake Store integrates with
Azure Active Directory (AAD) for:
Amount of data stored
How long data can be stored
Number of files
Size of the individual files
Ingestion throughput
15
Seamlessly scales
from a few KBs
to several PBs
ADL Store Unlimited Scale – How it works
Each file in ADL Store is sliced into blocks
Blocks are distributed across multiple data
nodes in the backend storage system
With sufficient number of backend storage
data nodes, files of any size can be stored
Backend storage runs in the Azure cloud
which has virtually unlimited resources
Metadata is stored about each file
No limit to metadata either.
16
Azure Data Lake Store file
…Block 1 Block 2 Block 2
Backend Storage
Data node Data node Data node Data node Data nodeData node
Block Block Block Block Block Block
ADL Store offers massive throughput
Through read parallelism ADL Store provides
massive throughput
Each read operation on a ADL Store file results
in multiple read operations executed in
parallel against the backend storage data
nodes
Read operation
17
Azure Data Lake Store file
…Block 1 Block 2 Block 2
Backend storage
Data node Data node Data node Data node Data nodeData node
Block Block Block Block Block Block
ADL Store: high availability and reliability
Azure maintains 3 replicas of each data object
per region across three fault and upgrade
domains
Each create or append operation on a replica
is replicated to other two
Writes are committed to application only after
all replicas are successfully updated
Read operations can go against
any replica
Data is never lost or unavailable
even under failures
Replica 1
Replica 2 Replica 3
Fault/upgrade
domains
Write Commit
18
The building
blocks
Ingestion, processing,
egress, visualization,
and orchestration tools
Ingestion tools – Getting started
Data on your desktop Data located in other stores
Building pipelines - Management and orchestration
Out-of-the-box tools Custom tools
Demo:
Management
Security
Security features
Identity Management &
Authentication
Access Control & Authorization
Auditing
Data Protection & Encryption
ADL Store Security: AAD integration
Multi-factor authentication based on OAuth2.0
Integration with on-premises AD for federated authentication
Role-based access control
Privileged account management
Application usage monitoring and rich auditing
Security monitoring and alerting
Fine-grained ACLs for AD identities
25
WHAT does PolyBase do?
Load Data
Use external tables as
an ETL tool to cleanse
data before loading
Interactively Query
Analyze relational and
non-relational data
together
Age-out Data
Age data to HDFS or
Azure Storage as ‘cold’
but query-able.
Reads/Writes Hadoop
file formats
Text, CSV, RCFILE, ORC,
Parquet, Avro, etc.
Parallelizes Data
Transfers
Direct loads to service
compute nodes
Imposes structure on
semi-structured data
Define external tables
over data
HOW does it do it?
Demo:
Polybase
DROP EXTERNAL TABLE dbo.DimCurrency_external;
DROP EXTERNAL FILE FORMAT TSVFileFormat;
DROP EXTERNAL FILE FORMAT CSVFileFormat;
DROP EXTERNAL DATA SOURCE AzureDataLakeStore;
DROP DATABASE SCOPED CREDENTIAL ADLCredential;
-- Create a Database Master Key.
--CREATE MASTER KEY;
-- Create a database scoped credential
CREATE DATABASE SCOPED CREDENTIAL ADLCredential
WITH
IDENTITY = 'e785bd5xxxxxx-465xxxxxxxxxx36b@https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token',
--IDENTITY = 'app_id@oauth_token'
SECRET = 'eAMyYHQhgTn/aslkdfj28347skdjoe7512=sadf2=‘; --(This is not the real key!)
-- Create an external data source
CREATE EXTERNAL DATA SOURCE AzureDataLakeStore
WITH (
TYPE = HADOOP,
LOCATION = 'adl://audreydatalake.azuredatalakestore.net',
CREDENTIAL = ADLCredential
);
-- Create an external file format
CREATE EXTERNAL FILE FORMAT CSVFileFormat
WITH
( FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = ','
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
)
);
CREATE EXTERNAL FILE FORMAT TSVFileFormat
WITH
( FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = ''
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
)
);
-- Create an External Table
CREATE EXTERNAL TABLE [dbo].[DimCurrency_external] (
[0] [int] NOT NULL,
[1] [varchar](5) NULL,
[2] [varchar](500) NULL
)
WITH
(
LOCATION='/Files/DimCurrency.csv'
, DATA_SOURCE = AzureDataLakeStore
, FILE_FORMAT = CSVFileFormat
, REJECT_TYPE = VALUE
, REJECT_VALUE = 0
);
--Query the file from ADLS
SELECT [0] as CurrencyKey, [1] as CurrencyCode, [2] as CurrencyDescription
FROM [dbo].[DimCurrency_external];
--Load file directly into DW with CTAS
IF EXISTS (SELECT * FROM sys.tables WHERE name = 'stgCurrency')
BEGIN
DROP TABLE stgCurrency
END;
CREATE TABLE dbo.stgCurrency
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT [0] as CurrencyKey, [1] as CurrencyCode, [2] as CurrencyDescription
FROM dbo.DimCurrency_external;
SELECT * FROM stgCurrency;

More Related Content

What's hot

Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data FlowMark Kromer
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMark Kromer
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Edureka!
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data FactorySlava Kokaev
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationMatthew W. Bowers
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Mark Kromer
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekMark Kromer
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data FactoryHARIHARAN R
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflakeSivakumar Ramar
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudDenodo
 
Migrating Your Databases to AWS - Tools and Services.pdf
Migrating Your Databases to AWS -  Tools and Services.pdfMigrating Your Databases to AWS -  Tools and Services.pdf
Migrating Your Databases to AWS - Tools and Services.pdfAmazon Web Services
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Cathrine Wilhelmsen
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 

What's hot (20)

Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar Presentation
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflake
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
 
Migrating Your Databases to AWS - Tools and Services.pdf
Migrating Your Databases to AWS -  Tools and Services.pdfMigrating Your Databases to AWS -  Tools and Services.pdf
Migrating Your Databases to AWS - Tools and Services.pdf
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
 
Azure Synapse Analytics
Azure Synapse AnalyticsAzure Synapse Analytics
Azure Synapse Analytics
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 

Similar to Data Analytics Meetup: Introduction to Azure Data Lake Storage

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionTravis Wright
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Samedi SQL Québec - La plateforme data de Azure
Samedi SQL Québec - La plateforme data de AzureSamedi SQL Québec - La plateforme data de Azure
Samedi SQL Québec - La plateforme data de AzureMSDEVMTL
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesCCG
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeRick van den Bosch
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 

Similar to Data Analytics Meetup: Introduction to Azure Data Lake Storage (20)

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive sessionMicrosoft ignite 2018 SQL server 2019 big data clusters - deep dive session
Microsoft ignite 2018 SQL server 2019 big data clusters - deep dive session
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Samedi SQL Québec - La plateforme data de Azure
Samedi SQL Québec - La plateforme data de AzureSamedi SQL Québec - La plateforme data de Azure
Samedi SQL Québec - La plateforme data de Azure
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data Services
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 

More from CCG

Introduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksIntroduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksCCG
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopCCG
 
Data Governance Workshop
Data Governance WorkshopData Governance Workshop
Data Governance WorkshopCCG
 
How to Monetize Your Data Assets and Gain a Competitive Advantage
How to Monetize Your Data Assets and Gain a Competitive AdvantageHow to Monetize Your Data Assets and Gain a Competitive Advantage
How to Monetize Your Data Assets and Gain a Competitive AdvantageCCG
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopCCG
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopCCG
 
How to Create a Data Analytics Roadmap
How to Create a Data Analytics RoadmapHow to Create a Data Analytics Roadmap
How to Create a Data Analytics RoadmapCCG
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopCCG
 
Power BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual WorkshopPower BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual WorkshopCCG
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopCCG
 
Artificial Intelligence Executive Brief
Artificial Intelligence Executive BriefArtificial Intelligence Executive Brief
Artificial Intelligence Executive BriefCCG
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
Virtual Governance in a Time of Crisis Workshop
Virtual Governance in a Time of Crisis WorkshopVirtual Governance in a Time of Crisis Workshop
Virtual Governance in a Time of Crisis WorkshopCCG
 
Advance Data Visualization and Storytelling Virtual Workshop
Advance Data Visualization and Storytelling Virtual WorkshopAdvance Data Visualization and Storytelling Virtual Workshop
Advance Data Visualization and Storytelling Virtual WorkshopCCG
 
Azure Fundamentals Part 3
Azure Fundamentals Part 3Azure Fundamentals Part 3
Azure Fundamentals Part 3CCG
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
Power BI Advance Modeling
Power BI Advance ModelingPower BI Advance Modeling
Power BI Advance ModelingCCG
 
Azure Fundamentals Part 2
Azure Fundamentals Part 2Azure Fundamentals Part 2
Azure Fundamentals Part 2CCG
 
Shape Your Data into a Data Model with M
Shape Your Data into a Data Model with MShape Your Data into a Data Model with M
Shape Your Data into a Data Model with MCCG
 
Azure Fundamentals Part 1
Azure Fundamentals Part 1Azure Fundamentals Part 1
Azure Fundamentals Part 1CCG
 

More from CCG (20)

Introduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksIntroduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & Databricks
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual Workshop
 
Data Governance Workshop
Data Governance WorkshopData Governance Workshop
Data Governance Workshop
 
How to Monetize Your Data Assets and Gain a Competitive Advantage
How to Monetize Your Data Assets and Gain a Competitive AdvantageHow to Monetize Your Data Assets and Gain a Competitive Advantage
How to Monetize Your Data Assets and Gain a Competitive Advantage
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual Workshop
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual Workshop
 
How to Create a Data Analytics Roadmap
How to Create a Data Analytics RoadmapHow to Create a Data Analytics Roadmap
How to Create a Data Analytics Roadmap
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual Workshop
 
Power BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual WorkshopPower BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual Workshop
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual Workshop
 
Artificial Intelligence Executive Brief
Artificial Intelligence Executive BriefArtificial Intelligence Executive Brief
Artificial Intelligence Executive Brief
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Virtual Governance in a Time of Crisis Workshop
Virtual Governance in a Time of Crisis WorkshopVirtual Governance in a Time of Crisis Workshop
Virtual Governance in a Time of Crisis Workshop
 
Advance Data Visualization and Storytelling Virtual Workshop
Advance Data Visualization and Storytelling Virtual WorkshopAdvance Data Visualization and Storytelling Virtual Workshop
Advance Data Visualization and Storytelling Virtual Workshop
 
Azure Fundamentals Part 3
Azure Fundamentals Part 3Azure Fundamentals Part 3
Azure Fundamentals Part 3
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Power BI Advance Modeling
Power BI Advance ModelingPower BI Advance Modeling
Power BI Advance Modeling
 
Azure Fundamentals Part 2
Azure Fundamentals Part 2Azure Fundamentals Part 2
Azure Fundamentals Part 2
 
Shape Your Data into a Data Model with M
Shape Your Data into a Data Model with MShape Your Data into a Data Model with M
Shape Your Data into a Data Model with M
 
Azure Fundamentals Part 1
Azure Fundamentals Part 1Azure Fundamentals Part 1
Azure Fundamentals Part 1
 

Recently uploaded

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 

Recently uploaded (20)

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 

Data Analytics Meetup: Introduction to Azure Data Lake Storage

  • 2. INGEST STORE PREP & TRAIN MODEL & SERVE Modern Data Warehouse Logs (unstructured) Azure Data Factory Azure Databricks Media (unstructured) Files (unstructured) PolyBase Business/custom apps (structured) Azure SQL Data Warehouse Azure Analysis Services Power BIData Lake
  • 4. ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) 5. Create reports. Analyze data All data not immediately required is discarded or archived 4
  • 5. Store indefinitely Analyze See results Gather data from all sources Iterate New big data thinking: All data has value All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit 5
  • 6. Data Lake Store: Technical Requirements 6 Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place). Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance. Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc. No one analytic framework can work for all data and all types of analysis. Multiple analytic frameworks Details Must be able to store data with all details; aggregation may lead to loss of details. Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark Reliable Must be highly available and reliable (no permanent loss of data). Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
  • 7. HDInsight Azure Data Lake Analytics Big Data analytics workloads A highly scalable, distributed, parallel file system in the cloud specifically designed to work with a variety of big data analytics workloads Azure Data Lake Store LOB Applications Social Devices Clickstream Sensors Video Web Relational Batch MapReduce Script Pig SQL Hive NoSQL HBase In-Memory Spark Predictive R Server Batch U-SQL
  • 8. Azure Data Lake Store Overview
  • 9. WebHDFS YARN U-SQL (extensible by C#, R and Python) Analytics HDInsight Store Azure Data Lake (Gen1)
  • 10. Blob Storage + (WASB) Azure Data Lake Store (ADLS) Blob Storage + (WASB) Rich Security Speed to Insight Cost Effectiveness Scale and Availability Cost Effectiveness Scale and Availability Azure Data Lake Storage Gen2 Rich Security Speed to Insight Cost Effectiveness Scale and Availability Scalable, secure storage that speeds time to insight WASB ADLSWASB Azure Data Lake Storage Gen2: Single Data Lake Store that combines the performance and innovation of ADLS with the scale and rich feature set of Blob Storage
  • 11. Azure Data Lake Storage Gen2 Key Features
  • 12. Azure Data Lake Store Gen2 (Preview)
  • 15. Azure Data Lake Store: no scale limits Azure Data Lake Store integrates with Azure Active Directory (AAD) for: Amount of data stored How long data can be stored Number of files Size of the individual files Ingestion throughput 15 Seamlessly scales from a few KBs to several PBs
  • 16. ADL Store Unlimited Scale – How it works Each file in ADL Store is sliced into blocks Blocks are distributed across multiple data nodes in the backend storage system With sufficient number of backend storage data nodes, files of any size can be stored Backend storage runs in the Azure cloud which has virtually unlimited resources Metadata is stored about each file No limit to metadata either. 16 Azure Data Lake Store file …Block 1 Block 2 Block 2 Backend Storage Data node Data node Data node Data node Data nodeData node Block Block Block Block Block Block
  • 17. ADL Store offers massive throughput Through read parallelism ADL Store provides massive throughput Each read operation on a ADL Store file results in multiple read operations executed in parallel against the backend storage data nodes Read operation 17 Azure Data Lake Store file …Block 1 Block 2 Block 2 Backend storage Data node Data node Data node Data node Data nodeData node Block Block Block Block Block Block
  • 18. ADL Store: high availability and reliability Azure maintains 3 replicas of each data object per region across three fault and upgrade domains Each create or append operation on a replica is replicated to other two Writes are committed to application only after all replicas are successfully updated Read operations can go against any replica Data is never lost or unavailable even under failures Replica 1 Replica 2 Replica 3 Fault/upgrade domains Write Commit 18
  • 19. The building blocks Ingestion, processing, egress, visualization, and orchestration tools
  • 20. Ingestion tools – Getting started Data on your desktop Data located in other stores
  • 21. Building pipelines - Management and orchestration Out-of-the-box tools Custom tools
  • 24. Security features Identity Management & Authentication Access Control & Authorization Auditing Data Protection & Encryption
  • 25. ADL Store Security: AAD integration Multi-factor authentication based on OAuth2.0 Integration with on-premises AD for federated authentication Role-based access control Privileged account management Application usage monitoring and rich auditing Security monitoring and alerting Fine-grained ACLs for AD identities 25
  • 26. WHAT does PolyBase do? Load Data Use external tables as an ETL tool to cleanse data before loading Interactively Query Analyze relational and non-relational data together Age-out Data Age data to HDFS or Azure Storage as ‘cold’ but query-able. Reads/Writes Hadoop file formats Text, CSV, RCFILE, ORC, Parquet, Avro, etc. Parallelizes Data Transfers Direct loads to service compute nodes Imposes structure on semi-structured data Define external tables over data HOW does it do it?
  • 28. DROP EXTERNAL TABLE dbo.DimCurrency_external; DROP EXTERNAL FILE FORMAT TSVFileFormat; DROP EXTERNAL FILE FORMAT CSVFileFormat; DROP EXTERNAL DATA SOURCE AzureDataLakeStore; DROP DATABASE SCOPED CREDENTIAL ADLCredential; -- Create a Database Master Key. --CREATE MASTER KEY; -- Create a database scoped credential CREATE DATABASE SCOPED CREDENTIAL ADLCredential WITH IDENTITY = 'e785bd5xxxxxx-465xxxxxxxxxx36b@https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token', --IDENTITY = 'app_id@oauth_token' SECRET = 'eAMyYHQhgTn/aslkdfj28347skdjoe7512=sadf2=‘; --(This is not the real key!) -- Create an external data source CREATE EXTERNAL DATA SOURCE AzureDataLakeStore WITH ( TYPE = HADOOP, LOCATION = 'adl://audreydatalake.azuredatalakestore.net', CREDENTIAL = ADLCredential ); -- Create an external file format CREATE EXTERNAL FILE FORMAT CSVFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS ( FIELD_TERMINATOR = ',' , STRING_DELIMITER = '' , DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff' , USE_TYPE_DEFAULT = FALSE ) ); CREATE EXTERNAL FILE FORMAT TSVFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS ( FIELD_TERMINATOR = '' , STRING_DELIMITER = '' , DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff' , USE_TYPE_DEFAULT = FALSE ) ); -- Create an External Table CREATE EXTERNAL TABLE [dbo].[DimCurrency_external] ( [0] [int] NOT NULL, [1] [varchar](5) NULL, [2] [varchar](500) NULL ) WITH ( LOCATION='/Files/DimCurrency.csv' , DATA_SOURCE = AzureDataLakeStore , FILE_FORMAT = CSVFileFormat , REJECT_TYPE = VALUE , REJECT_VALUE = 0 ); --Query the file from ADLS SELECT [0] as CurrencyKey, [1] as CurrencyCode, [2] as CurrencyDescription FROM [dbo].[DimCurrency_external]; --Load file directly into DW with CTAS IF EXISTS (SELECT * FROM sys.tables WHERE name = 'stgCurrency') BEGIN DROP TABLE stgCurrency END; CREATE TABLE dbo.stgCurrency WITH (DISTRIBUTION = ROUND_ROBIN) AS SELECT [0] as CurrencyKey, [1] as CurrencyCode, [2] as CurrencyDescription FROM dbo.DimCurrency_external; SELECT * FROM stgCurrency;

Editor's Notes

  1. All data has immediate or potential value This leads to data hoarding—all data is stored indefinitely With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation Schema is imposed and transformations are done at query time (schema-on-read). Applications and users interpret the data as they see fit.
  2. Azure Data Factory First-class support for ADL Store Support variety of endpoints WASB, OnPrem, Relational DB Integrated with Analytic tools Programmatic customization Azure Stream Analytics Stream data from Eventuhubs into ADL store OSS Tools Use Oozie & Falcon on HDInsight to manage tools like DistCP and Sqoop Use Storm to stream data from Eventhubs/Kafka into ADL Store PowerShell Use built-in commandlets Use PowerShell Workflow Runbooks to manage Use PowerShell Script Runbooks to manage ADL Store SDK Available in various languages (.NET, Java, Node.js, ..) REST APIs For unsupported languages and platforms
  3. Security imagery.