SlideShare uma empresa Scribd logo
1 de 62
Differentiate Big Data vs
Data Warehouse use
cases for a cloud solution
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
Blog: JamesSerra.com
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
Data Lake Driven Analytics
Compute technologies
Patterns
Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data sources
ETL
BI and analytic
Data warehouse
Gather
Requirements
Business
Requirements
Technical
Requirements
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Needs data governance so your data lake does not turn
into a data swamp!
Considering Data Types
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet,
ORC). Strict data model structure
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
What happened?
What is happening?
Why did it happen?
What are key
relationships?
What will happen?
What if?
How risky is it?
What should happen?
What is the best option?
How can I optimize?
Data sources
Data Lake Data Warehouse
Schema-on-read Schema-on-write
Physical collection of uncurated data Data of common meaning
System of Insight: Unknown data to do
experimentation / data discovery
System of Record: Well-understood data to do
operational reporting
Any type of data Limited set of data types (ie. relational)
Skills are limited Skills mostly available
All workloads – batch, interactive, streaming,
machine learning
Optimized for interactive querying
Complementary to DW Can be sourced from Data Lake
Data Warehouse
Serving, Security & Compliance
Low latency
Interactive ad-hoc query
High number of users
Additional security
Large support for tools
Easily create reports (Self-service BI)
A data lake is just a glorified file folder with
data files in it – how many end-users can
accurately create reports from it?
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
Azure
HDInsight
Hadoop and Spark
as a Service on Azure
Fully-managed Hadoop and Spark
for the cloud
100% Open Source Hortonworks
data platform
Clusters up and running in minutes
Managed, monitored and supported
by Microsoft with the industry’s best SLA
Familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower TCO than deploy your own
Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Azure
Data Lake Analytics
A new distributed
analytics service
Distributed analytics service built on
Apache YARN
Elastic scale per query lets users focus on
business goals—not configuring hardware
Includes U-SQL—a language that unifies the
benefits of SQL with the expressive
power of C#
Integrates with Visual Studio to develop,
debug, and tune code faster
Federated query across Azure data sources
Enterprise-grade role based access control
CONTROL EASE OF USE
Azure Data Lake
Analytics
Azure Data Lake Store
Azure Storage
Any Hadoop technology,
any distribution
Workload optimized,
managed clusters
Data Engineering in a
Job-as-a-service model
Azure Marketplace
HDP | CDH | MapR
Azure Data Lake
Analytics
IaaS Clusters Managed Clusters Big Data as-a-service
Azure HDInsight
Frictionless & Optimized
Spark clusters
Azure Databricks
BIGDATA
STORAGE
BIGDATA
ANALYTICS
ReducedAdministration
K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S
Azure SQL Data Warehouse
A relational data warehouse-as-a-service, fully managed by Microsoft.
Industries first elastic cloud data warehouse with enterprise-grade capabilities.
Support your smallest to your largest data storage needs while handling queries up to 100x faster.
Compute
Storage
$$$
Compute
Control
Remote
Storage
Intelligent
Cache
Azure Analysis Services
Why extend the data warehouse?
Semantic layer
Handle many concurrent users
Aggregating data for performance
Multidimensional analysis
No joins or relationships
Hierarchies, KPI’s
Row-level security
Advanced time-calculations
Slowly Changing Dimensions (SCD)
Azure SQL DW HDInsight Hive LLAP HDInsight Spark ADLS/ADLA SQL Server (IaaS)
Volume Petabytes Petabytes Petabytes Petabytes Terabytes
Security Encryption, TD,
Audit
ADLS / Apache
Ranger
ADLS AAD Security
Groups (data)
Encryption, TD Audit
Languages T-SQL HiveQL SparkSQL,
HiveQL, Scala,
Java, Python, R
U-SQL T-SQL
Extensibility No Yes, .NET/SerDe Yes, Packages Yes, .NET Yes, .NET CLR
External File
Types
ORC, TXT,
Parquet, RCFile
ORC, CSV, Parquet +
others
Parquet, JSON,
Hive + others
Many ORC, TXT, Parquet,
RCFile
Admin Low-Medium Medium-High Medium-High Low High
Cost Model DWU Nodes & VM Nodes & VM Units/Jobs VM
Schema
Definition
Schema on Write
/ Polybase
Schema on Read Schema on Read Schema on Read Schema on Write /
Polybase
Max DB Size Unlimited CCI
240TB Comp (5X =
1PB) index/heaps
Unlimited 256TB (64 4TB
drives)
Microsoft Products vs Hadoop/OSS Products
Note: Many of the Hadoop/OSS products are available in Azure
Microsoft Product Hadoop/Open Source Software Product
Office365/Excel OpenOffice/Calc
Cosmos DB MongoDB, MarkLogic, HBase, Cassandra
SQL Database SQLite, MySQL, PostgreSQL, MariaDB, Apache Ignite
Azure Data Lake Analytics/YARN None
Azure VM/IaaS OpenStack
Blob Storage HDFS, Ceph
Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion
Event Hub Apache Kafka
Azure Stream Analytics Apache Storm, Apache Spark Streaming, Apache Flink, Apache Beam, Twitter Heron
Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana
HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay)
Azure ML (Machine Learning) Apache Mahout, Apache Spark MLib, Apache PredictionIO
Microsoft R Open R
SQL Data Warehouse/Interactive queries Apache Hive LLAP, Presto, Apache Spark SQL, Apache Drill, Apache Impala
IoT Hub Apache NiFi
Azure Data Factory Apache Falcon, Airbnb Airflow, Apache Oozie, Apache Azkaban
Azure Data Lake Storage/WebHDFS HDFS Ozone
Azure Analysis Services/SSAS Apache Kylin, Apache Druid, AtScale (pay)
SQL Server Reporting Services None
Hadoop Indexes Jethro Data (pay)
Azure Data Catalog Apache Atlas
PolyBase Apache Drill
Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES)
SQL Server Integration Services (SSIS) Talend Open Studio, Pentaho Data Integration
Others
Apache Ambari (manage Hadoop clusters), Apache Ranger (data security such as row/column-level security), Apache Knox (secure entry point for
Hadoop clusters), Apache Flume (collecting log data)
Questions to ask client
• Can you use the cloud?
• Is this a new solution or a migration?
• Do the developers have Hadoop skills?
• Will you use non-relational data (variety)?
• How much data do you need to store (volume)?
• Is this an OLTP or OLAP/DW solution?
• Will you have streaming data (velocity)?
• Will you use dashboards?
• How fast do the operational reports need to run?
• Will you do predictive analytics?
• Do you want to use Microsoft tools or open source?
• What are your high availability and/or disaster recovery requirements?
• Do you need to master the data (MDM)?
• Are there any security limitations with storing data in the cloud?
• Does this solution require 24/7 client access?
• How many concurrent users will be accessing the solution at peak-time and on average?
• What is the skill level of the end users?
• What is your budget and timeline?
• Is the source data cloud-born and/or on-prem born?
• How much daily data needs to be imported into the solution?
• What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)?
• Are you ok with using products that are in preview?
Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP & TRAIN MODEL & SERVE
Data orchestration
and monitoring
Big data store Hadoop/Spark and
machine learning
Data warehouse
Your data hub for analytics
Cloud Bursting
BI + Reporting
Azure Data Factory Azure Blob Storage Azure Databricks
Azure Data Lake
Azure HDInsight
Azure Machine Learning
Machine Learning Server
Azure SQL Data Warehouse
Azure Analysis Services
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Blob Storage
Analytical
dashboards
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Analysis
Services
Azure Data Factory
PolyBase
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Data Lake Store
Analytical
dashboards
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Tableau
Server
PolyBase
Operational
Reports
Ad-Hoc
Query
Azure SQL
Database
Azure Data Lake Store Azure Blob Storage
Purpose Optimized storage for big data analytics
workloads
General purpose object storage for a wide variety of
storage scenarios including big data analytics
Use Cases Batch, interactive, streaming analytics and
machine learning data such as log files, IoT data,
click streams, large datasets
Any type of text or binary data, such as application back
end, backup data, media storage for streaming and general
purpose data as well as big data analytics
Units of Storage Accounts / Folders / Files Accounts / Containers / Blobs
Structure Hierarchical File System Object Store with flat namespace
REST API WebHDFS-compatible Azure Blob Storage, compatible HDFS via WASB driver
Security Azure Active Directory (AAD) Shared Access Signature (SAS) keys
Authorization POSIX Access Control Lists (ACLs) Account-level: Account Access Keys;
Account, container, or blob authorization: SAS keys
Account/File Size Limits No limits on account size or file size 5PB account/4.75TB file
Single Object/Account
Throughput Limit
Extremely high 2GB/s, or 50k tps (now stripe across multiple hard
drives)/50GBs bandwidth
Geo-Replications LRS LRS, ZRS, GRS, RA-GRS
Cost/Month [1PB, East US 2] No tiering: $39k + Transactions Tiering: Hot $18k, Cool $10k, Archive $2k + Trans
Product integration/Tooling Check Check
Region Availability Two US regions (East, Central) & North Europe All Azure Regions
ADL Store and Blob Store
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Blob Storage
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Data Factory
Azure Databricks
Azure HDInsight
Data Lake Analytics
Analytical
dashboards
PolyBase
Business/custom apps
(Structured) Azure Analysis
Services
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Blob Storage
Analytical
dashboards
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Data Factory
Azure Machine Learning
Machine Learning Server
PolyBase
Azure Databricks
Azure HDInsight
Azure Data Lake Analytics
Azure Analysis
Services
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Blob Storage
PolyBase
Analytical
dashboards
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Machine Learning
Machine Learning Server
Sensors and IoT
(unstructured)
Azure HDInsight (Kafka)
Azure IoT Hub
Azure Databricks
Azure HDInsight
Azure Data Lake Analytics
Azure Analysis
Services
https://aka.ms/ADAG
Azure 10MTC Demo scenario (Connected Retail): aka.ms/a10, under Modern Data Warehouse
Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)
Who manages what?
Infrastructure
as a Service
Storage
Servers
Networking
O/S
Middleware
Virtualization
Data
Applications
Runtime
ManagedbyMicrosoft
Youscale,make
resilient&manage
Platform
as a Service
Scale,Resilienceand
managementbyMicrosoft
Youmanage
Storage
Servers
Networking
O/S
Middleware
Virtualization
Applications
Runtime
Data
On Premises
Physical / Virtual
Youscale,makeresilientandmanage
Storage
Servers
Networking
O/S
Middleware
Virtualization
Data
Applications
Runtime
Software
as a Service
Storage
Servers
Networking
O/S
Middleware
Virtualization
Applications
Runtime
Data
Scale,Resilienceand
managementbyMicrosoft
Windows Azure
Virtual Machines
Windows Azure
Cloud Services
Data management for analytics
at any stage
Query historical,
relational data from a
variety of sources
STAGE 2:
Operational
STAGE 1:
Traditional Gain real-time insights
without impacting
performance
Ask questions of big
data—all types,
volumes and
locations
STAGE 4:
Free-form
STAGE 3:
Logical Establish enterprise-
wide data lake and run
advanced analytics
and deep learning on
unstructured data that
arrives in real-time
Microsoft data platform solutions (partial list)
Product Category Description More Info
SQL Server 2017 RDBMS Earned top spot in Gartner’s Operational Database magic
quadrant. JSON support. Linux support
https://www.microsoft.com/en-us/server-
cloud/products/sql-server-2017/
SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. Has
built-in high availability and disaster recovery. JSON support.
https://azure.microsoft.com/en-
us/services/sql-database/
SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. Provision
and scale quickly. Can pause service to reduce cost
https://azure.microsoft.com/en-
us/services/sql-data-warehouse/
Azure Data Lake Store/Blob
Storage
Hadoop storage Removes the complexities of ingesting and storing all of your
data while making it faster to get up and running with batch,
streaming, and interactive analytics
https://azure.microsoft.com/en-
us/services/data-lake-store/
https://azure.microsoft.com/en-
us/services/storage/blobs
HDInsight PaaS Hadoop
compute/Hadoop
clusters-as-a-service
A managed Apache Hadoop, Spark, R Server, HBase, Kafka,
Interactive Query (Hive LLAP) and Storm cloud service made
easy
https://azure.microsoft.com/en-
us/services/hdinsight/
Azure Databricks PaaS Spark clusters A fast, easy, and collaborative Apache Spark based analytics
platform optimized for Azure
https://databricks.com/azure
Azure Data Lake Analytics On-demand analytics job
service/Big Data-as-a-
service
Cloud-based service that dynamically provisions resources so
you can run queries on exabytes of data. Includes U-SQL, a
new big data query language
https://azure.microsoft.com/en-
us/services/data-lake-analytics/
Azure Analysis Services Online analytical
processing/PaaS
OLAP engine used in decision support and business analytics,
providing the analytical data for business reports and client
apps
https://azure.microsoft.com/en-
us/services/analysis-services/
Azure Cosmos DB PaaS NoSQL: Key-value,
Column-family,
Document, Graph
Globally distributed, massively scalable, multi-model, multi-API,
low latency data service – which can be used as an operational
database or a hot data lake
https://azure.microsoft.com/en-
us/services/cosmos-db/
Azure Database for PostgreSQL,
MySQL, and MariaDB
RDBMS/DBaaS A fully managed database service for app developers https://azure.microsoft.com/en-
us/services/postgresql
ExpressRoute
We guarantee 99.95% ExpressRoute dedicated circuit availability
(3.6TB/hour)
(18TB/hour)
(36TB/hour)
(7.2TB/hour)
Data Volume
Low
High
Latency
Cost/GB
High
Low
Request Rate
Structure
High
Low
SQL
(SQL DB, SQL DW)
NoSQL
(DocDB)
HDFS
(HDP, Cloudera)
ADLS
Hot Data Cold Data
Cache
Data Lake Store: Technical Requirements
50
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.
Low latency Must have low latency for high-frequency operations.
Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc.
No one analytic framework can work for all data and all types of analysis.
Multiple analytic
frameworks
Details Must be able to store data with all details; aggregation may lead to loss of details.
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark
Reliable Must be highly available and reliable (no permanent loss of data).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
50 TB
100 TB
500 TB
10 TB
5 PB
1.000
100
10.000
3-5 Way
Joins
 Joins +
 OLAP operations +
 Aggregation +
 Complex “Where”
constraints +
 Views
 Parallelism
5-10 Way
Joins
Normalized
Multiple, Integrated
Stars and Normalized
Simple
Star
Multiple,
Integrated
Stars
TB’s
MB’s
GB’s
Batch Reporting,
Repetitive Queries
Ad Hoc Queries
Data Analysis/Mining
Near Real Time
Data Feeds
Daily
Load
Weekly
Load
Strategic, Tactical
Strategic
Strategic, Tactical
Loads
Strategic, Tactical
Loads, SLA
“Query Freedom“
“Query complexity“
“Data
Freshness”
“Query Data Volume“
“Query Concurrency“
“Mixed
Workload”
“Schema Sophistication“
“Data Volume”
DW SCALABILITY SPIDER CHART
MPP – Multidimensional
Scalability
SMP – Tunable in one dimension
on cost of other dimensions
The spiderweb depicts
important attributes to
consider when evaluating
Data Warehousing options.
Big Data support is newest
dimension.
HiveQL T-SQL
Updates INSERT UPDATE, INSERT, DELETE
Transactions Supported Supported
Index Types Compact, Bitmap Clustered, Non-Clustered, Columnar
Latency Minutes Seconds
Data Types + map, struct, array Int, float, boolean, char, binary..
Functions Dozens of built-in functions Hundreds of built-in functions
Multi-Table Inserts Supported Not Supported
Partitioning Supported Supported
Multi-Level Partitioning Supported Not Supported
Views Read-Only Read-Only and Updateable
Extensibility UDF’s, MapReduce Scripts UDF’s, Stored Procedures
Want
Hadoop?
Need exact
same on-
prem
Need
interactive /
streaming?
Mandatory
No strong opinion
Azure Marketplace (IaaS)
• Need all workloads exactly like on-
premises
• Need 100% Hortonworks/Cloudera/MapR
Azure HDInsight
• Most Hadoop workloads
• Fully managed by Microsoft
• Sell HDI + ADLS
• Stickier to Microsoft than VMs
• Can do interactive (Spark) and streaming
(Storm/Spark)
Azure Data Lake Analytics
• Easiest experience for admin: no sense of
clusters, instant scale per job
• Easiest experience for developers: Visual
Studio/U-SQL (C#+SQL)
• Sell ADLA + ADLS
• Batch workloads only
Need everything exactly
like on-prem
Need core
projects Yes Batch is OK
Always present
ADLA if .NET or
Visual Studio Shop
If .NET or
VS shop?
HDInsight vs HDP on Azure VM
HDInsight HDP on Azure VM
PaaS (setup, scale, manage, patch, etc) IaaS
Managed by Microsoft Managed by customer
Storage separate (Blob or ADLS) Storage in VM (local disk), but can also
have storage in Azure blob or ADLS
Delete VM keeps data Delete VM deletes data (unless external)
Up to 30-days behind latest HDP version Latest HDP Version
Limited Hadoop projects Unlimited Hadoop projects
Microsoft supports VM and Hadoop Microsoft: VM, HDP: Hadoop
No on-prem version On-prem version
PolyBase use cases
Ingestion Batch Presentation
Company C
Event hubs
Machine
Learning
Flatten &
Metadata Join
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
Machine
Learning Azure SQL
Data Warehouse
Power BI
INGEST PREPARE ANALYZE PUBLISH
ASA Job Rule #2
CONSUMEDATA SOURCES
Cortana
Web/LOB
Dashboards
On Premise
Hot Path
Cold Path
Archived
Data
Data Lake
Store
Simulated Sensors
and devices
Blobs –
Reference Data
Event hubs ASA Job Rule #1
Event hubs
Real-time Scoring
Aggregated Data
Data Lake
Store
CSV Data
Data Lake
Store
Data Lake
Analytics
Batch Scoring
Offline Training
Hourly, Daily,
Monthly Roll Ups
Ingestion
Batch
PresentationSpeed
SQL Server
R Services
Linux
Hadoop Teradata
Windows
CommercialCommunity
R ServerR Open
R vs Azure ML
R Open is Microsoft’s open source version of R that (I believe) adds a few additional capabilities but
largely mirrors the existing open source R.
Microsoft R Server (MRS) adds additional capability that is not available in open source R include R
Scale for large scale deployment of R jobs on clusters (i.e. HDInsight) and Microsoft ML (MML) which
is a library of Microsoft’s best-in-breed ML algorithms available from within MRS only.
SQL Server R is a SQL product that adds the ability to apply R functions/algorithms to data operations
performed from within SQL server.
Azure ML is a GUI-based product for building ML experiments and web services which includes
access to most of the same underlying algorithms available programmatically in MML inside MRS.
R Server Technology
ConnectR
• High-speed & direct
connectors
Available for:
• High-performance XDF
• SAS, SPSS, delimited & fixed
format text data files
• Hadoop HDFS (text & XDF)
• Hive
• Parquet
• Teradata Database (TPT)
• EDWs and ADWs
• ODBC
ScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R algorithms
across nodes
• Wide data sets supported – thousands of variables
DistributedR
• Distributed computing framework
• Delivers cross-platform portability
R+CRAN
• Open source R interpreter
• R 3.3.2
• Freely-available huge range of R
algorithms
• Algorithms callable by MSR
• Embeddable in R scripts
• 100% Compatible with existing R scripts,
functions and packages
RevoR
• Performance enhanced R
interpreter
• Based on open source R
• Adds high-performance
math library to speed up
linear algebra functions
Big Data In-memory bound In-memory bound
Operates on large volumes when connected to
R Server
Hybrid memory & disk scalability
Operates on bigger volumes & factors
Speed of Analysis Multiple-threaded
when MKL installed
Same as R Open for non-ScaleR functions;
Up to 2 threads for ScaleR functions when
compute locally;
Parallel threading and Parallel Processing
Enterprise
Readiness
Community
support
Commercial support Commercial support
Analytic Breadth &
Depth
7000+ innovative
analytic packages
Leverage and optimize open source packages
plus Big Data ready packages (ScaleR APIs)
Leverage and optimize open source packages plus
Multi threaded and Big Data ready packages
Commercial
Viability
Risk of deployment
of open source
Free for everyone Commercial licenses
R Client
Microsoft R product comparison
R Open
Datasize
In-memory
In-memory In-Memory or Disk Based
Speed of Analysis
Single threaded Multi-threaded
Multi-threaded, parallel
processing 1:N servers
Support
Community Community Community + Commercial
Analytic Breadth
& Depth 10k+ innovative analytic
packages
10k+ innovative analytic
packages
10k+ innovative packages +
commercial parallel high-speed
functions
License
Open Source
Open Source
Commercial license.
Supported release with indemnity
MRO MRS
Copyright Microsoft Corporation. All rights reserved.
• Re-Use Existing & New R Skills
Analytical Performance
Data Movement
Interoperability
Tool TCO
Governance
Productivity
R Skills
• Escape Legacy Tool Expense
• Dis-Incent “Shadow IT”
• Build Mixed-Talent Teams
• Analyze Data in Place
• Direct & Hybrid Deployment
• 10x – 50x Faster

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 

Semelhante a Differentiate Big Data vs Data Warehouse use cases for a cloud solution

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive
 

Semelhante a Differentiate Big Data vs Data Warehouse use cases for a cloud solution (20)

How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Azure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandyAzure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandy
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloads
 

Mais de James Serra

Mais de James Serra (19)

Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
How to build your career
How to build your careerHow to build your career
How to build your career
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
HA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridHA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybrid
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Differentiate Big Data vs Data Warehouse use cases for a cloud solution

  • 1. Differentiate Big Data vs Data Warehouse use cases for a cloud solution James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com Blog: JamesSerra.com
  • 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  • 3. Agenda Data Lake Driven Analytics Compute technologies Patterns
  • 4.
  • 5. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation
  • 6. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Data sources ETL BI and analytic Data warehouse Gather Requirements Business Requirements Technical Requirements
  • 7. Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 8.
  • 9. Needs data governance so your data lake does not turn into a data swamp!
  • 10. Considering Data Types Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured CSV, Columnar Storage (Parquet, ORC). Strict data model structure Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  • 11. What happened? What is happening? Why did it happen? What are key relationships? What will happen? What if? How risky is it? What should happen? What is the best option? How can I optimize? Data sources
  • 12. Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do experimentation / data discovery System of Record: Well-understood data to do operational reporting Any type of data Limited set of data types (ie. relational) Skills are limited Skills mostly available All workloads – batch, interactive, streaming, machine learning Optimized for interactive querying Complementary to DW Can be sourced from Data Lake
  • 13. Data Warehouse Serving, Security & Compliance Low latency Interactive ad-hoc query High number of users Additional security Large support for tools Easily create reports (Self-service BI) A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
  • 14.
  • 15. What is Azure Databricks? A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
  • 16. Azure HDInsight Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open Source Hortonworks data platform Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industry’s best SLA Familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower TCO than deploy your own Hadoop on-premises* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  • 17. Azure Data Lake Analytics A new distributed analytics service Distributed analytics service built on Apache YARN Elastic scale per query lets users focus on business goals—not configuring hardware Includes U-SQL—a language that unifies the benefits of SQL with the expressive power of C# Integrates with Visual Studio to develop, debug, and tune code faster Federated query across Azure data sources Enterprise-grade role based access control
  • 18. CONTROL EASE OF USE Azure Data Lake Analytics Azure Data Lake Store Azure Storage Any Hadoop technology, any distribution Workload optimized, managed clusters Data Engineering in a Job-as-a-service model Azure Marketplace HDP | CDH | MapR Azure Data Lake Analytics IaaS Clusters Managed Clusters Big Data as-a-service Azure HDInsight Frictionless & Optimized Spark clusters Azure Databricks BIGDATA STORAGE BIGDATA ANALYTICS ReducedAdministration K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S
  • 19. Azure SQL Data Warehouse A relational data warehouse-as-a-service, fully managed by Microsoft. Industries first elastic cloud data warehouse with enterprise-grade capabilities. Support your smallest to your largest data storage needs while handling queries up to 100x faster.
  • 23. Why extend the data warehouse? Semantic layer Handle many concurrent users Aggregating data for performance Multidimensional analysis No joins or relationships Hierarchies, KPI’s Row-level security Advanced time-calculations Slowly Changing Dimensions (SCD)
  • 24. Azure SQL DW HDInsight Hive LLAP HDInsight Spark ADLS/ADLA SQL Server (IaaS) Volume Petabytes Petabytes Petabytes Petabytes Terabytes Security Encryption, TD, Audit ADLS / Apache Ranger ADLS AAD Security Groups (data) Encryption, TD Audit Languages T-SQL HiveQL SparkSQL, HiveQL, Scala, Java, Python, R U-SQL T-SQL Extensibility No Yes, .NET/SerDe Yes, Packages Yes, .NET Yes, .NET CLR External File Types ORC, TXT, Parquet, RCFile ORC, CSV, Parquet + others Parquet, JSON, Hive + others Many ORC, TXT, Parquet, RCFile Admin Low-Medium Medium-High Medium-High Low High Cost Model DWU Nodes & VM Nodes & VM Units/Jobs VM Schema Definition Schema on Write / Polybase Schema on Read Schema on Read Schema on Read Schema on Write / Polybase Max DB Size Unlimited CCI 240TB Comp (5X = 1PB) index/heaps Unlimited 256TB (64 4TB drives)
  • 25. Microsoft Products vs Hadoop/OSS Products Note: Many of the Hadoop/OSS products are available in Azure Microsoft Product Hadoop/Open Source Software Product Office365/Excel OpenOffice/Calc Cosmos DB MongoDB, MarkLogic, HBase, Cassandra SQL Database SQLite, MySQL, PostgreSQL, MariaDB, Apache Ignite Azure Data Lake Analytics/YARN None Azure VM/IaaS OpenStack Blob Storage HDFS, Ceph Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion Event Hub Apache Kafka Azure Stream Analytics Apache Storm, Apache Spark Streaming, Apache Flink, Apache Beam, Twitter Heron Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay) Azure ML (Machine Learning) Apache Mahout, Apache Spark MLib, Apache PredictionIO Microsoft R Open R SQL Data Warehouse/Interactive queries Apache Hive LLAP, Presto, Apache Spark SQL, Apache Drill, Apache Impala IoT Hub Apache NiFi Azure Data Factory Apache Falcon, Airbnb Airflow, Apache Oozie, Apache Azkaban Azure Data Lake Storage/WebHDFS HDFS Ozone Azure Analysis Services/SSAS Apache Kylin, Apache Druid, AtScale (pay) SQL Server Reporting Services None Hadoop Indexes Jethro Data (pay) Azure Data Catalog Apache Atlas PolyBase Apache Drill Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES) SQL Server Integration Services (SSIS) Talend Open Studio, Pentaho Data Integration Others Apache Ambari (manage Hadoop clusters), Apache Ranger (data security such as row/column-level security), Apache Knox (secure entry point for Hadoop clusters), Apache Flume (collecting log data)
  • 26.
  • 27. Questions to ask client • Can you use the cloud? • Is this a new solution or a migration? • Do the developers have Hadoop skills? • Will you use non-relational data (variety)? • How much data do you need to store (volume)? • Is this an OLTP or OLAP/DW solution? • Will you have streaming data (velocity)? • Will you use dashboards? • How fast do the operational reports need to run? • Will you do predictive analytics? • Do you want to use Microsoft tools or open source? • What are your high availability and/or disaster recovery requirements? • Do you need to master the data (MDM)? • Are there any security limitations with storing data in the cloud? • Does this solution require 24/7 client access? • How many concurrent users will be accessing the solution at peak-time and on average? • What is the skill level of the end users? • What is your budget and timeline? • Is the source data cloud-born and/or on-prem born? • How much daily data needs to be imported into the solution? • What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)? • Are you ok with using products that are in preview?
  • 28. Advanced Analytics Social LOB Graph IoT Image CRM INGEST STORE PREP & TRAIN MODEL & SERVE Data orchestration and monitoring Big data store Hadoop/Spark and machine learning Data warehouse Your data hub for analytics Cloud Bursting BI + Reporting Azure Data Factory Azure Blob Storage Azure Databricks Azure Data Lake Azure HDInsight Azure Machine Learning Machine Learning Server Azure SQL Data Warehouse Azure Analysis Services
  • 29. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Blob Storage Analytical dashboards Business/custom apps (Structured) Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Analysis Services Azure Data Factory PolyBase
  • 30. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Data Lake Store Analytical dashboards Business/custom apps (Structured) Logs, files and media (unstructured) Azure SQL Data Warehouse Tableau Server PolyBase Operational Reports Ad-Hoc Query Azure SQL Database
  • 31. Azure Data Lake Store Azure Blob Storage Purpose Optimized storage for big data analytics workloads General purpose object storage for a wide variety of storage scenarios including big data analytics Use Cases Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click streams, large datasets Any type of text or binary data, such as application back end, backup data, media storage for streaming and general purpose data as well as big data analytics Units of Storage Accounts / Folders / Files Accounts / Containers / Blobs Structure Hierarchical File System Object Store with flat namespace REST API WebHDFS-compatible Azure Blob Storage, compatible HDFS via WASB driver Security Azure Active Directory (AAD) Shared Access Signature (SAS) keys Authorization POSIX Access Control Lists (ACLs) Account-level: Account Access Keys; Account, container, or blob authorization: SAS keys Account/File Size Limits No limits on account size or file size 5PB account/4.75TB file Single Object/Account Throughput Limit Extremely high 2GB/s, or 50k tps (now stripe across multiple hard drives)/50GBs bandwidth Geo-Replications LRS LRS, ZRS, GRS, RA-GRS Cost/Month [1PB, East US 2] No tiering: $39k + Transactions Tiering: Hot $18k, Cool $10k, Archive $2k + Trans Product integration/Tooling Check Check Region Availability Two US regions (East, Central) & North Europe All Azure Regions ADL Store and Blob Store
  • 32. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Blob Storage Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Data Factory Azure Databricks Azure HDInsight Data Lake Analytics Analytical dashboards PolyBase Business/custom apps (Structured) Azure Analysis Services
  • 33. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Blob Storage Analytical dashboards Business/custom apps (Structured) Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Data Factory Azure Machine Learning Machine Learning Server PolyBase Azure Databricks Azure HDInsight Azure Data Lake Analytics Azure Analysis Services
  • 34.
  • 35. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Blob Storage PolyBase Analytical dashboards Business/custom apps (Structured) Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Machine Learning Machine Learning Server Sensors and IoT (unstructured) Azure HDInsight (Kafka) Azure IoT Hub Azure Databricks Azure HDInsight Azure Data Lake Analytics Azure Analysis Services
  • 36.
  • 37.
  • 39. Azure 10MTC Demo scenario (Connected Retail): aka.ms/a10, under Modern Data Warehouse
  • 40. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)
  • 41.
  • 42. Who manages what? Infrastructure as a Service Storage Servers Networking O/S Middleware Virtualization Data Applications Runtime ManagedbyMicrosoft Youscale,make resilient&manage Platform as a Service Scale,Resilienceand managementbyMicrosoft Youmanage Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data On Premises Physical / Virtual Youscale,makeresilientandmanage Storage Servers Networking O/S Middleware Virtualization Data Applications Runtime Software as a Service Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data Scale,Resilienceand managementbyMicrosoft Windows Azure Virtual Machines Windows Azure Cloud Services
  • 43. Data management for analytics at any stage Query historical, relational data from a variety of sources STAGE 2: Operational STAGE 1: Traditional Gain real-time insights without impacting performance Ask questions of big data—all types, volumes and locations STAGE 4: Free-form STAGE 3: Logical Establish enterprise- wide data lake and run advanced analytics and deep learning on unstructured data that arrives in real-time
  • 44. Microsoft data platform solutions (partial list) Product Category Description More Info SQL Server 2017 RDBMS Earned top spot in Gartner’s Operational Database magic quadrant. JSON support. Linux support https://www.microsoft.com/en-us/server- cloud/products/sql-server-2017/ SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. Has built-in high availability and disaster recovery. JSON support. https://azure.microsoft.com/en- us/services/sql-database/ SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. Provision and scale quickly. Can pause service to reduce cost https://azure.microsoft.com/en- us/services/sql-data-warehouse/ Azure Data Lake Store/Blob Storage Hadoop storage Removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics https://azure.microsoft.com/en- us/services/data-lake-store/ https://azure.microsoft.com/en- us/services/storage/blobs HDInsight PaaS Hadoop compute/Hadoop clusters-as-a-service A managed Apache Hadoop, Spark, R Server, HBase, Kafka, Interactive Query (Hive LLAP) and Storm cloud service made easy https://azure.microsoft.com/en- us/services/hdinsight/ Azure Databricks PaaS Spark clusters A fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure https://databricks.com/azure Azure Data Lake Analytics On-demand analytics job service/Big Data-as-a- service Cloud-based service that dynamically provisions resources so you can run queries on exabytes of data. Includes U-SQL, a new big data query language https://azure.microsoft.com/en- us/services/data-lake-analytics/ Azure Analysis Services Online analytical processing/PaaS OLAP engine used in decision support and business analytics, providing the analytical data for business reports and client apps https://azure.microsoft.com/en- us/services/analysis-services/ Azure Cosmos DB PaaS NoSQL: Key-value, Column-family, Document, Graph Globally distributed, massively scalable, multi-model, multi-API, low latency data service – which can be used as an operational database or a hot data lake https://azure.microsoft.com/en- us/services/cosmos-db/ Azure Database for PostgreSQL, MySQL, and MariaDB RDBMS/DBaaS A fully managed database service for app developers https://azure.microsoft.com/en- us/services/postgresql
  • 45. ExpressRoute We guarantee 99.95% ExpressRoute dedicated circuit availability (3.6TB/hour) (18TB/hour) (36TB/hour) (7.2TB/hour)
  • 46. Data Volume Low High Latency Cost/GB High Low Request Rate Structure High Low SQL (SQL DB, SQL DW) NoSQL (DocDB) HDFS (HDP, Cloudera) ADLS Hot Data Cold Data Cache
  • 47. Data Lake Store: Technical Requirements 50 Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place). Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance. Low latency Must have low latency for high-frequency operations. Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc. No one analytic framework can work for all data and all types of analysis. Multiple analytic frameworks Details Must be able to store data with all details; aggregation may lead to loss of details. Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark Reliable Must be highly available and reliable (no permanent loss of data). Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
  • 48. 50 TB 100 TB 500 TB 10 TB 5 PB 1.000 100 10.000 3-5 Way Joins  Joins +  OLAP operations +  Aggregation +  Complex “Where” constraints +  Views  Parallelism 5-10 Way Joins Normalized Multiple, Integrated Stars and Normalized Simple Star Multiple, Integrated Stars TB’s MB’s GB’s Batch Reporting, Repetitive Queries Ad Hoc Queries Data Analysis/Mining Near Real Time Data Feeds Daily Load Weekly Load Strategic, Tactical Strategic Strategic, Tactical Loads Strategic, Tactical Loads, SLA “Query Freedom“ “Query complexity“ “Data Freshness” “Query Data Volume“ “Query Concurrency“ “Mixed Workload” “Schema Sophistication“ “Data Volume” DW SCALABILITY SPIDER CHART MPP – Multidimensional Scalability SMP – Tunable in one dimension on cost of other dimensions The spiderweb depicts important attributes to consider when evaluating Data Warehousing options. Big Data support is newest dimension.
  • 49. HiveQL T-SQL Updates INSERT UPDATE, INSERT, DELETE Transactions Supported Supported Index Types Compact, Bitmap Clustered, Non-Clustered, Columnar Latency Minutes Seconds Data Types + map, struct, array Int, float, boolean, char, binary.. Functions Dozens of built-in functions Hundreds of built-in functions Multi-Table Inserts Supported Not Supported Partitioning Supported Supported Multi-Level Partitioning Supported Not Supported Views Read-Only Read-Only and Updateable Extensibility UDF’s, MapReduce Scripts UDF’s, Stored Procedures
  • 50. Want Hadoop? Need exact same on- prem Need interactive / streaming? Mandatory No strong opinion Azure Marketplace (IaaS) • Need all workloads exactly like on- premises • Need 100% Hortonworks/Cloudera/MapR Azure HDInsight • Most Hadoop workloads • Fully managed by Microsoft • Sell HDI + ADLS • Stickier to Microsoft than VMs • Can do interactive (Spark) and streaming (Storm/Spark) Azure Data Lake Analytics • Easiest experience for admin: no sense of clusters, instant scale per job • Easiest experience for developers: Visual Studio/U-SQL (C#+SQL) • Sell ADLA + ADLS • Batch workloads only Need everything exactly like on-prem Need core projects Yes Batch is OK Always present ADLA if .NET or Visual Studio Shop If .NET or VS shop?
  • 51. HDInsight vs HDP on Azure VM HDInsight HDP on Azure VM PaaS (setup, scale, manage, patch, etc) IaaS Managed by Microsoft Managed by customer Storage separate (Blob or ADLS) Storage in VM (local disk), but can also have storage in Azure blob or ADLS Delete VM keeps data Delete VM deletes data (unless external) Up to 30-days behind latest HDP version Latest HDP Version Limited Hadoop projects Unlimited Hadoop projects Microsoft supports VM and Hadoop Microsoft: VM, HDP: Hadoop No on-prem version On-prem version
  • 53.
  • 55. Company C Event hubs Machine Learning Flatten & Metadata Join Data Factory: Move Data, Orchestrate, Schedule, and Monitor Machine Learning Azure SQL Data Warehouse Power BI INGEST PREPARE ANALYZE PUBLISH ASA Job Rule #2 CONSUMEDATA SOURCES Cortana Web/LOB Dashboards On Premise Hot Path Cold Path Archived Data Data Lake Store Simulated Sensors and devices Blobs – Reference Data Event hubs ASA Job Rule #1 Event hubs Real-time Scoring Aggregated Data Data Lake Store CSV Data Data Lake Store Data Lake Analytics Batch Scoring Offline Training Hourly, Daily, Monthly Roll Ups Ingestion Batch PresentationSpeed
  • 56.
  • 57. SQL Server R Services Linux Hadoop Teradata Windows CommercialCommunity R ServerR Open
  • 58. R vs Azure ML R Open is Microsoft’s open source version of R that (I believe) adds a few additional capabilities but largely mirrors the existing open source R. Microsoft R Server (MRS) adds additional capability that is not available in open source R include R Scale for large scale deployment of R jobs on clusters (i.e. HDInsight) and Microsoft ML (MML) which is a library of Microsoft’s best-in-breed ML algorithms available from within MRS only. SQL Server R is a SQL product that adds the ability to apply R functions/algorithms to data operations performed from within SQL server. Azure ML is a GUI-based product for building ML experiments and web services which includes access to most of the same underlying algorithms available programmatically in MML inside MRS.
  • 59. R Server Technology ConnectR • High-speed & direct connectors Available for: • High-performance XDF • SAS, SPSS, delimited & fixed format text data files • Hadoop HDFS (text & XDF) • Hive • Parquet • Teradata Database (TPT) • EDWs and ADWs • ODBC ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelized analytics • Data prep & data distillation • Descriptive statistics & statistical tests • Range of predictive functions • User tools for distributing customized R algorithms across nodes • Wide data sets supported – thousands of variables DistributedR • Distributed computing framework • Delivers cross-platform portability R+CRAN • Open source R interpreter • R 3.3.2 • Freely-available huge range of R algorithms • Algorithms callable by MSR • Embeddable in R scripts • 100% Compatible with existing R scripts, functions and packages RevoR • Performance enhanced R interpreter • Based on open source R • Adds high-performance math library to speed up linear algebra functions
  • 60. Big Data In-memory bound In-memory bound Operates on large volumes when connected to R Server Hybrid memory & disk scalability Operates on bigger volumes & factors Speed of Analysis Multiple-threaded when MKL installed Same as R Open for non-ScaleR functions; Up to 2 threads for ScaleR functions when compute locally; Parallel threading and Parallel Processing Enterprise Readiness Community support Commercial support Commercial support Analytic Breadth & Depth 7000+ innovative analytic packages Leverage and optimize open source packages plus Big Data ready packages (ScaleR APIs) Leverage and optimize open source packages plus Multi threaded and Big Data ready packages Commercial Viability Risk of deployment of open source Free for everyone Commercial licenses R Client Microsoft R product comparison R Open
  • 61. Datasize In-memory In-memory In-Memory or Disk Based Speed of Analysis Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 10k+ innovative analytic packages 10k+ innovative analytic packages 10k+ innovative packages + commercial parallel high-speed functions License Open Source Open Source Commercial license. Supported release with indemnity MRO MRS Copyright Microsoft Corporation. All rights reserved.
  • 62. • Re-Use Existing & New R Skills Analytical Performance Data Movement Interoperability Tool TCO Governance Productivity R Skills • Escape Legacy Tool Expense • Dis-Incent “Shadow IT” • Build Mixed-Talent Teams • Analyze Data in Place • Direct & Hybrid Deployment • 10x – 50x Faster