SlideShare uma empresa Scribd logo
1 de 60
Choosing technologies for a big
data solution in the cloud
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
 Definitions
 Decision process on technologies
 Technologies to choose from
 Comparing technologies
 Common big data architectures
Material from many presentations
Presentations (mine):
 Relational databases vs Non-relational databases
 Should I move my database to the cloud?
 Big data architectures and the data lake
 Introducing Azure SQL Database
 Introducing Azure SQL Data Warehouse
 Introduction to DocumentDB
 Building an Effective Data Warehouse Architecture
 Building a Big Data Solution
 How does Microsoft solve Big Data?
 Introduction to PolyBase
Considering Data Types
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet,
ORC). Strict data model structure
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Big Data = All Data!
What is Big Data?
• Variety: It can be structured, semi-structured, or unstructured
• Velocity: It can be streaming, near real-time or batch
• Volume: It can be 1GB or 1PB
• Big data is the new currency
Any BI tool
Advanced Analytics
Any languageBig Data processing
Data warehousing
Relational data
Dashboards | Reporting
Mobile BI | Cubes
Machine Learning
Stream analytics | Cognitive | AI
.NET | Java | R | Python
Ruby | PHP | Scala
Non-relational data
Datavirtualization
OLTP ERP CRM LOB
The Data Management Platform for Analytics
Social media DevicesWeb Media
On-premises Cloud
Benefits of the cloud
Agility
• Unlimited elastic scale
• Pay for what you need
Innovation
• Quick “Time to market”
• Fail fast
Risk
• Availability
• Reliability
• Security
Total cost of ownership calculator: https://www.tco.microsoft.com/
Who manages what?
Infrastructure
as a Service
Storage
Servers
Networking
O/S
Middleware
Virtualization
Data
Applications
Runtime
ManagedbyMicrosoft
Youscale,make
resilient&manage
Platform
as a Service
Scale,Resilienceand
managementbyMicrosoft
Youmanage
Storage
Servers
Networking
O/S
Middleware
Virtualization
Applications
Runtime
Data
On Premises
Physical / Virtual
Youscale,makeresilientandmanage
Storage
Servers
Networking
O/S
Middleware
Virtualization
Data
Applications
Runtime
Software
as a Service
Storage
Servers
Networking
O/S
Middleware
Virtualization
Applications
Runtime
Data
Scale,Resilienceand
managementbyMicrosoft
Windows Azure
Virtual Machines
Windows Azure
Cloud Services
Questions to ask client
• Can you use the cloud?
• Is this a new solution or a migration?
• Do the developers have Hadoop skills?
• Will you use non-relational data (variety)?
• How much data do you need to store (volume)?
• Is this an OLTP or OLAP/DW solution?
• Will you have streaming data (velocity)?
• Will you use dashboards?
• How fast do the operational reports need to run?
• Will you do predictive analytics?
• Do you want to use Microsoft tools or open source?
• What are your high availability and/or disaster recovery requirements?
• Do you need to master the data (MDM)?
• Are there any security limitations with storing data in the cloud?
• Does this solution require 24/7 client access?
• How many concurrent users will be accessing the solution at peak-time and on average?
• What is the skill level of the end users?
• What is your budget and timeline?
• Is the source data cloud-born and/or on-prem born?
• How much daily data needs to be imported into the solution?
• What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)?
• Are you ok with using products that are in preview?
DBMS vs NoSQL Decision Tree
Big Data Solutions Decision Tree
Thanks to Ivan Kosyakov: https://biz-excellence.com/2016/08/30/big-data-dt/
Machine Learning Solutions Decision Tree
Thanks to Ivan Kosyakov: https://biz-excellence.com/2016/09/13/machine-learning-dt/
Enterprise Information Management Decision Tree
Thanks to Ivan Kosyakov: https://biz-excellence.com/2017/04/17/eim/
Business Intelligence Solutions Decision Tree
Thanks to Ivan Kosyakov: https://biz-excellence.com/2017/05/16/bi-decision-tree/
SMP vs MPP
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing
50 TB
100 TB
500 TB
10 TB
5 PB
1.000
100
10.000
3-5 Way
Joins
 Joins +
 OLAP operations +
 Aggregation +
 Complex “Where”
constraints +
 Views
 Parallelism
5-10 Way
Joins
Normalized
Multiple, Integrated
Stars and Normalized
Simple
Star
Multiple,
Integrated
Stars
TB’s
MB’s
GB’s
Batch Reporting,
Repetitive Queries
Ad Hoc Queries
Data Analysis/Mining
Near Real Time
Data Feeds
Daily
Load
Weekly
Load
Strategic, Tactical
Strategic
Strategic, Tactical
Loads
Strategic, Tactical
Loads, SLA
“Query Freedom“
“Query complexity“
“Data
Freshness”
“Query Data Volume“
“Query Concurrency“
“Mixed
Workload”
“Schema Sophistication“
“Data Volume”
DW SCALABILITY SPIDER CHART
MPP – Multidimensional
Scalability
SMP – Tunable in one dimension
on cost of other dimensions
The spiderweb depicts
important attributes to
consider when evaluating
Data Warehousing options.
Big Data support is newest
dimension.
Relational Databases vs Non-Relational Databases (NoSQL) vs Hadoop
• RDBMS for enterprise OLTP and ACID compliance, or db’s under 5TB
• NoSQL for scaled OLTP and JSON documents
• Hadoop for big data analytics (OLAP)
(from my presentation “Relational Databases vs Non-Relational Databases”)
Velocity
Volume Per
Day
Real-world
Transactions
Per Day
Real-world
Transactions
Per Second
Relational DB Document
Store
Key Value or
Wide Column
8 GB 8.64B 100,000 As Is
86 GB 86.4B 1M Tuned* As Is
432 GB 432B 5M Appliance Tuned* As Is
864 GB 864B 10M Clustered
Appliance
Clustered
Servers
Tuned*
8,640 GB 8.64T 100M Many
Clustered
Servers
Clustered
Servers
43,200 GB 43.2T 500M Many
Clustered
Servers
* Tuned means tuning the model, queries, and/or hardware (more CPU, RAM, and Flash)
Microsoft data platform solutions
Product Category Description More Info
SQL Server 2016 RDBMS Earned top spot in Gartner’s Operational Database magic
quadrant. JSON support. Linux TBD
https://www.microsoft.com/en-us/server-
cloud/products/sql-server-2016/
SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly.
Has built-in high availability and disaster recovery. JSON
support
https://azure.microsoft.com/en-
us/services/sql-database/
SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data.
Provision and scale quickly. Can pause service to reduce
cost
https://azure.microsoft.com/en-
us/services/sql-data-warehouse/
Analytics Platform System (APS) MPP RDBMS Big data analytics appliance for high performance and
seamless integration of all your data
https://www.microsoft.com/en-us/server-
cloud/products/analytics-platform-
system/
Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of
your data while making it faster to get up and running with
batch, streaming, and interactive analytics
https://azure.microsoft.com/en-
us/services/data-lake-store/
Azure Data Lake Analytics On-demand analytics job
service/Big Data-as-a-
service
Cloud-based service that dynamically provisions resources
so you can run queries on exabytes of data. Includes U-
SQL, a new big data query language
https://azure.microsoft.com/en-
us/services/data-lake-analytics/
HDInsight PaaS Hadoop
compute/Hadoop
clusters-as-a-service
A managed Apache Hadoop, Spark, R, HBase, Kafka, and
Storm cloud service made easy
https://azure.microsoft.com/en-
us/services/hdinsight/
Azure Cosmos DB PaaS NoSQL: Document
Store
Get your apps up and running in hours with a fully
managed NoSQL database service that indexes, stores, and
queries data using familiar SQL syntax
https://azure.microsoft.com/en-
us/services/documentdb/
Azure Table Storage PaaS NoSQL: Key-value
Store
Store large amount of semi-structured data in the cloud https://azure.microsoft.com/en-
us/services/storage/tables/
Microsoft Big Data Portfolio
SQL Server Stretch
Business intelligence
Machine learning analytics
Insights
Azure SQL Database
SQL Server 2016
SQL Server 2016 Fast Track
Azure SQL DW
ADLS & ADLA
Cosmos DB
HDInsight
Hadoop
Analytics Platform System
Sequential Scale Out + AcrossScale Up
Key
Relational Non-relational
On-premisesCloud
Microsoft has solutions covering
and connecting all four
quadrants – that’s why SQL
Server is one of the most utilized
databases in the world
Azure SQL Data Warehouse
A relational data warehouse-as-a-service, fully managed by Microsoft.
Industries first elastic cloud data warehouse with enterprise-grade capabilities.
Support your smallest to your largest data storage needs while handling queries up to 100x faster.
Azure
Data Lake Store
A hyper-scale
repository for Big Data
analytics workloads
Hadoop File System (HDFS) for the cloud
No limits to scale
Store any data in its native format
Enterprise-grade access control,
encryption at rest
Optimized for analytic workload performance
Data lake is the center of a big data solution
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Collect all data “just in case”
• Store data with no modeling – “Schema on read”
• Complements EDW
• Frees up expensive EDW resources
• Quick user access to data
• ETL Hadoop tools
• Easily scalable
• With Hadoop, high availability built in
Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
This solves the two biggest reasons why may EDW projects fail:
• Too much time spent modeling when you don’t know all of the questions your data needs to answer
• Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value
Data Lake layers
• Raw data layer– Raw events are stored for historical reference. Also called
staging layer or landing area
• Cleansed data layer – Raw events are transformed (cleaned and mastered) into
directly consumable data sets. Aim is to uniform the way files are stored in
terms of encoding, format, data types and content (i.e. strings). Also called
conformed layer
• Application data layer – Business logic is applied to the cleansed data to
produce data ready to be consumed by applications (i.e. DW application,
advanced analysis process, etc). Also called workspace layer or trusted layer or
presentation layer
Still need data governance so your data lake does not turn into a data swamp!
Azure
HDInsight
Hadoop and Spark
as a Service on Azure
Fully-managed Hadoop and Spark
for the cloud
100% Open Source Hortonworks
data platform
Clusters up and running in minutes
Managed, monitored and supported
by Microsoft with the industry’s best SLA
Familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower TCO than deploy your own
Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Azure
Data Lake Analytics
A new distributed
analytics service
Distributed analytics service built on
Apache YARN
Elastic scale per query lets users focus on
business goals—not configuring hardware
Includes U-SQL—a language that unifies the
benefits of SQL with the expressive
power of C#
Integrates with Visual Studio to develop,
debug, and tune code faster
Federated query across Azure data sources
Enterprise-grade role based access control
Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
• Avoid moving large amounts of data across the network
between stores (federated query/logical data warehouse)
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by maintaining
multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Filters, Joins
• SELECT * FROM EXTERNAL MyDataSource EXECUTE
@”Select CustName from Customers WHERE ID=1”;
(not pushdown)
• SELECT CustName FROM EXTERNAL MyDataSource
WHERE ID=1 LOCATION “dbo.Customers” (pushdown)
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage
PolyBase
Query relational and non-relational data with T-SQL
By preview early this year PolyBase will add support for Teradata, Oracle,
SQL Server, MongoDB, and generic ODBC (Spark, Hive, Impala, DB2)
Vs U-SQL: PolyBase is interactive while U-SQL is batch. U-SQL more code
to query data but more formats (JSON) and libraries/UDOs and supports
writes to blob/ADLS
PolyBase use cases
PolyBase Reality
PolyBase in: Parallelize Data
Load (Blob and
ADLS)
Federated
Query (push
down)
HDInsights
Federated Query
(push down)
HDP/Cloudera
(local or blob)
Federated Query
(push down)
New 5
Age Out
Data
SQL DW Yes N/A N/A No on-prem support Maybe
SQL Server 2016 Yes via scale-out
groups. Blob, not
ALDS
N Y (MapReduce
job)
Y Maybe
Supports: UTF-8 and UTF-16 encoded delimited text, RC File, ORC,
Parquet, gzip, zlib, Snappy. Not supported: extended ASCII, fixed-file
format, WinZip, JSON, and XML. SQL DB not supported
SQL DW now supports ADLS but not compute pushdown
• ADLS in only two regions (East US 2, Central US)
• SQL DW: Think of PolyBase as mechanism for data loading
• SQL Server 2016: Think of PolyBase for federated querying
• PolyBase supports row sizes up to 1MB
• Writes only to blob/ADLS (using CETAS)
• Requires External Table command
PolyBase parallelized reads:
 Supported: in SQL using CTAS or INSERT INTO
 Not supported, BCP, Bulk Insert, SQLBulkCopy
 Not supported: SSIS (unless used to call stored procedure containing CTAS)
 Supported: ADF
o If source compatible with PolyBase, will directly copy
o If source not compatible, will stage to Blob
o If source is ADLS, will still stage to Blob (to be fixed end February)
SSAS/Azure Analysis Services Cubes
Reasons to report off cubes instead of the data warehouse:
 Semantic layer
 Handle many concurrent users
 Aggregating data for performance
 Multidimensional analysis
 No joins or relationships
 Hierarchies, KPI’s
 Security
 Advanced time-calculations
 Slowly Changing Dimensions (SCD)
 Required for some reporting tools
Azure
Data Lake Store
Azure
Blob Storage
Purpose Optimized for big data analytics General purpose bulk storage
Use Cases Batch, Interactive, Streaming App backend, backup data, media storage
for streaming
Units of Storage Accounts / Folders / Files Accounts / Containers / Blobs
Structure Hierarchical File System Flat namespace
WebHDFS Implements WebHDFS No (WASB)
Security AD SAS keys
Storage Auto Shared/Files chunked Manually manage expansion/Files intact
Size Limits No limits on account size, file size, # files 500TB account, 4.75TB file
Service State Generally Available Generally Available
Billing Pay for data stored and for I/O Pay for data stored and for I/O
Region Availability Two US regions (Other regions coming soon) All Azure Regions
ADL Store vs Blob Store
Want
Hadoop?
Need exact
same on-
prem
Need
interactive /
streaming?
Mandatory
No strong opinion
Azure Marketplace (IaaS)
• Need all workloads exactly like on-
premises
• Need 100% Hortonworks/Cloudera/MapR
Azure HDInsight
• Most Hadoop workloads
• Fully managed by Microsoft
• Sell HDI + ADLS
• Stickier to Microsoft than VMs
• Can do interactive (Spark) and streaming
(Storm/Spark)
Azure Data Lake Analytics
• Easiest experience for admin: no sense of
clusters, instant scale per job
• Easiest experience for developers: Visual
Studio/U-SQL (C#+SQL)
• Sell ADLA + ADLS
• Batch workloads only
Need everything exactly
like on-prem
Need core
projects Yes Batch is OK
Always present
ADLA if .NET or
Visual Studio Shop
If .NET or
VS shop?
APS with HP CS300
SMP
MPP
SUPPORTS
NON-
RELATIONAL
CLOUD
●● ● ●
● ●
●
● ● ●
PRE-
ENGINEERED ●●
●
The data warehousing portfolio from Microsoft
Comprehensive solutions
●
Azure SQL DW HDInsight Hive HDInsight Spark Azure Data Lake SQL Server (IaaS)
Volume Petabytes Petabytes Petabytes Petabytes Terabytes
Security Encryption, TD,
Audit
ADLS / Apache
Ranger
ADLS AAD Security
Groups (data)
Encryption, TD
Audit
Languages T-SQL (subset) HiveQL SparkSQL, HiveQL,
Scala, Java,
Python, R
U-SQL T-SQL
Extensibility No Yes, .NET/SerDe Yes, Packages Yes, .NET Yes, .NET CLR
External File
Types
ORC, TXT,
Parquet, RCFile
ORC, CSV, Parquet
+ others
Parquet, JSON,
Hive + others
Many ORC, TXT, Parquet,
RCFile
Admin Low-Medium Medium-High Medium-High Low High
Cost Model DWU Nodes & VM Nodes & VM Units/Jobs VM
Schema
Definition
Schema on
Write / Polybase
Schema on Read Schema on Read Schema on Read Schema on Write /
Polybase
Max DB Size 240TB Comp
(5X = 1PB)
Unlimited 64TB (64 1TB
drives)
Data Warehouse Future
SQL DW
• Replicated tables in private preview (it’s a cache)
• 10PB max db size this summer
SQL DB
• 4TB in public preview (1TB now)
• Project Cloud Lift, instance level, 35TB max db, true SQL Server compatibility (cross-database
queries), private preview March CY17, public preview H2CY17; Socrates: 100TB max db
VM
• GS5: 32 cores, 448GB memory, 80k disk IOPS
• Superdome X: 384 cores, 24TB memory, 92TB disk
• Larger disks Q2CY17 (up to 4TB SSDs), so 256TB max database; 8TB end CY17, 32TB CY18
• New VM sizes with much more cores and memory on the way
• SQL14/SQL16 have a feature called “Data Files in Azure Storage Blobs” that allows it to store its
data/log files on as many Blobs as desired. This allows going above the VM storage limit. Writes
are the same. Reads are slower (1ms to 5ms) given that there is no read cache
Data Lake Data Warehouse
Complementary to DW Can be sourced from Data Lake
Schema-on-read Schema-on-write
Physical collection of uncurated data Data of common meaning
System of Insight: Unknown data to do
experimentation / data discovery
System of Record: Well-understood data to do
operational reporting
Any type of data Limited set of data types (ie. relational)
Skills are limited Skills mostly available
All workloads – batch, interactive, streaming,
machine learning
Optimized for interactive querying
Roles when using both Data Lake and DW
Data Lake/Hadoop (staging and processing environment)
• Batch reporting
• Data refinement/cleaning
• ETL workloads
• Store historical data
• Sandbox for data exploration
• One-time reports
• Data scientist workloads
• Quick results
Data Warehouse/RDBMS (serving and compliance environment)
• Low latency
• High number of users
• Additional security
• Large support for tools
• Easily create reports (Self-service BI)
• A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption,
BI/visualization)
Consume
(Alerts, Operational
Stats, Insights)
Lambda Architecture : Interactive Analytics Pipeline
Data Consumption
(Ingestion)
Stream Layer (data in motion)
Batch Layer (data at rest)
Presentation/Serving
Layer
Microsoft Products vs Hadoop/OSS Products
Microsoft Product Hadoop/Open Source Software Product
Office365/Excel OpenOffice/Calc
DocumentDB MongoDB, HBase, Cassandra
SQL Database SQLite, MySQL, PostgreSQL, MariaDB
Azure Data Lake Analytics/YARN None
Azure VM/IaaS OpenStack
Blob Storage HDFS, Ceph (Note: These are distributed file systems and Blob storage is not distributed)
Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion
Event Hub Apache Kafka
Azure Stream Analytics Apache Storm, Apache Spark, Twitter Heron
Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana
HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay)
Azure ML Apache Mahout, Apache Spark MLib
Microsoft R Open R
SQL Data Warehouse Apache Hive, Apache Drill, Presto
IoT Hub Apache NiFi
Azure Data Factory Apache Falcon, Apache Oozie, Airbnb Airflow
Azure Data Lake Storage/WebHDFS HDFS Ozone
Azure Analysis Services/SSAS Apache Kylin, Apache Lens, AtScale (pay)
SQL Server Reporting Services None
Hadoop Indexes Jethro Data (pay)
Azure Data Catalog Apache Atlas
PolyBase Apache Drill
Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES)
Others Apache Flink, Apache Ambari, Apache Ranger, Apache Knox
Note: Many of the Hadoop/OSS products are available in Azure
Business
apps
Custom
apps
Sensors
and devices
Events Events
Spark Streaming
Stream Processing
Azure
Stream Analytics
Event Processing
Azure Event
Hubs
Kafka
Events
Events
Choosing a Ingestion Technology
Kafka Azure Event Hubs
Managed No Yes
Ordering Yes Yes
Delivery At-least-once At-least-once
Lifetime Configurable 1-30 Days
Replication Configurable within Region Yes
Throughput *nodes 20 throughput units
Parallel Clients Yes No
MapReduce Yes No
Record Size Configurable 256K
Cost Low + Admin Low
Choosing a Stream Processing Technology
Azure Stream Analytics Storm Spark Streaming
Managed Yes Yes Yes
Temporal Operators Windowed aggregates, and temporal
joins are supported out of the box.
Temporal operators must to be
implemented
Temporal operators must to be
implemented
Development
Experience
Interactive authoring and debugging
experience through Azure Portal on
sample data.
Visual Studio, etc Visual Studio, etc
Data Encoding formats Stream Analytics requires UTF-8 data
format to be utilized.
Any data encoding format may be
implemented via custom code.
Any data encoding format may be
implemented via custom code.
Scalability Number of Streaming Units for each
job. Each Streaming Unit processes up
to 1MB/s. Max of 50 units by default.
Call to increase limit.
Number of nodes in the HDI Storm
cluster. No limit on number of nodes
(Top limit defined by your Azure
quota). Call to increase limit.
Number of nodes in the HDI Spark
cluster. No limit on number of
nodes (Top limit defined by your
Azure quota). Call to increase limit.
Data processing limits Users can scale up or down number of
Streaming Units to increase data
processing or optimize costs.
Scale up to 1 GB/s
User can scale up or down cluster
size to meet needs.
User can scale up or down cluster
size to meet needs.
Late arrival and out of
order event handling
Built-in configurable policies to
reorder, drop events or adjust event
time.
User must implement logic to handle
this scenario.
User must implement logic to
handle this scenario.
https://microsoft.sharepoint.com/teams/AzureSolutionCafe
Cloud Big Data Solution
Excel
Third party
BI tools
Cloud data sources
SQL Database
SQL
Data Warehouse
Direct Query
Cached Model
Power BI
Power BI
Embedded
SQL Server
Other
data sources
Power BI
Desktop
Visual Studio
Authoring and
development tools
On-premises
data sources
Teradata
Oracle
Direct Query
Cached Model
Gateway
Cloud
visualization tools
On-premises
visualization tools
Azure
Analysis Services
Analytics
Platform System
Other data
sources
Interactive Analytics and Predictive Pipeline using Azure Data Factory
Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption,
BI/visualization)
Consume
(Alerts, Operational
Stats, Insights)
Machine Learning
(Failure and RCA
Predictions)
Azure SQL
(Predictions)
HDI Custom ETL
Aggregate /Partition
Azure Storage Blob
dashboard of
predictions /
alerts
PowerBI
dashboard
(Shared with field
Ops, customers,
MIS, and Engineers)
Baseline Architecture : Interactive Analytics Pipeline
Near Realtime Data Analytics Pipeline using Azure Steam Analytics
Big Data Analytics Pipeline using Azure Data Lake
Interactive Analytics and Predictive Pipeline using Azure Data Factory
Base Architecture : Big Data Advanced Analytics Pipeline
Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption,
BI/visualization)
Consume
(Alerts, Operational
Stats, Insights)
Machine Learning
(Failure and RCA
Predictions)
Telemetry
Azure SQL
(Predictions)
HDI Custom ETL
Aggregate /Partition
Azure Storage Blob
dashboard of
predictions /
alerts
Live / real-time data
stats, Anomalies and
aggregates
Custome
r MIS
Event
Hub
PowerBI
dashboard
Stream Analytics
(real-time analytics)
Azure Data Lake Analytics
(Big Data Processing)
Azure Data Lake
Storage
Azure SQL
(COL + TACOPS)
Data
in
MotionData
at
Rest
dashboard of
operational
stats FDS +
SDS
(Shared with field
Ops, customers,
MIS, and Engineers)
Scheduledhourly
transferusingAzure
DataFactory
Machine
Learning
(Anomaly Detection)
Schneider Electric Architecture
Event hubs
Machine
Learning
Flatten &
Metadata Join
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
Machine
Learning Azure SQL
Data Warehouse
Power BI
INGEST PREPARE ANALYZE PUBLISH
ASA Job Rule #2
CONSUMEDATA SOURCES
Cortana
Web/LOB
Dashboards
On Premise
Hot Path
Cold Path
Archived
Data
Data Lake
Store
Simulated Sensors
and devices
Blobs –
Reference Data
Event hubs ASA Job Rule #1
Event hubs
Real-time Scoring
Aggregated Data
Data Lake
Store
CSV Data
Data Lake
Store
Data Lake
Analytics
Batch Scoring
Offline Training
Hourly, Daily,
Monthly Roll Ups
Ingestion
Batch
PresentationSpeed
Resources
 Relational database vs Non-relational databases: http://bit.ly/1HXn2Rt
 Types of NoSQL databases: http://bit.ly/1HXn8Zl
 What is Polyglot Persistence? http://bit.ly/1HXnhMm
 Hadoop and Data Warehouses: http://bit.ly/1xuXfu9
 Hadoop and Microsoft: http://bit.ly/20Cg2hA
Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted via the “Presentations” link on the top menu)

Mais conteúdo relacionado

Mais procurados

Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 

Mais procurados (20)

Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
 
Enterprise Data Architecture Deliverables
Enterprise Data Architecture DeliverablesEnterprise Data Architecture Deliverables
Enterprise Data Architecture Deliverables
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Modern Data Platform on AWS
Modern Data Platform on AWSModern Data Platform on AWS
Modern Data Platform on AWS
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 

Destaque

Destaque (9)

Microsoft Ignite 2017 - SQL Server on Kubernetes, Swarm, and Open Shift
Microsoft Ignite 2017 - SQL Server on Kubernetes, Swarm, and Open ShiftMicrosoft Ignite 2017 - SQL Server on Kubernetes, Swarm, and Open Shift
Microsoft Ignite 2017 - SQL Server on Kubernetes, Swarm, and Open Shift
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
SQL Server 2017 Deep Dive - @Ignite 2017
SQL Server 2017 Deep Dive - @Ignite 2017SQL Server 2017 Deep Dive - @Ignite 2017
SQL Server 2017 Deep Dive - @Ignite 2017
 
SQL Server 2017 on Linux Introduction
SQL Server 2017 on Linux IntroductionSQL Server 2017 on Linux Introduction
SQL Server 2017 on Linux Introduction
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
 
Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
 
Bootcamp 2017 - SQL Server on Linux
Bootcamp 2017 - SQL Server on LinuxBootcamp 2017 - SQL Server on Linux
Bootcamp 2017 - SQL Server on Linux
 

Semelhante a Choosing technologies for a big data solution in the cloud

MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB
 

Semelhante a Choosing technologies for a big data solution in the cloud (20)

How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 
SoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in UtahSoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in Utah
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Benefits of the Azure cloud
Benefits of the Azure cloudBenefits of the Azure cloud
Benefits of the Azure cloud
 
IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading Strategies
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure
 

Mais de James Serra

Mais de James Serra (18)

Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
How to build your career
How to build your careerHow to build your career
How to build your career
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
HA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridHA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybrid
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Choosing technologies for a big data solution in the cloud

  • 1. Choosing technologies for a big data solution in the cloud James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com
  • 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  • 3. Agenda  Definitions  Decision process on technologies  Technologies to choose from  Comparing technologies  Common big data architectures
  • 4. Material from many presentations Presentations (mine):  Relational databases vs Non-relational databases  Should I move my database to the cloud?  Big data architectures and the data lake  Introducing Azure SQL Database  Introducing Azure SQL Data Warehouse  Introduction to DocumentDB  Building an Effective Data Warehouse Architecture  Building a Big Data Solution  How does Microsoft solve Big Data?  Introduction to PolyBase
  • 5.
  • 6. Considering Data Types Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured CSV, Columnar Storage (Parquet, ORC). Strict data model structure Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  • 7. Big Data = All Data! What is Big Data? • Variety: It can be structured, semi-structured, or unstructured • Velocity: It can be streaming, near real-time or batch • Volume: It can be 1GB or 1PB • Big data is the new currency
  • 8. Any BI tool Advanced Analytics Any languageBig Data processing Data warehousing Relational data Dashboards | Reporting Mobile BI | Cubes Machine Learning Stream analytics | Cognitive | AI .NET | Java | R | Python Ruby | PHP | Scala Non-relational data Datavirtualization OLTP ERP CRM LOB The Data Management Platform for Analytics Social media DevicesWeb Media On-premises Cloud
  • 9. Benefits of the cloud Agility • Unlimited elastic scale • Pay for what you need Innovation • Quick “Time to market” • Fail fast Risk • Availability • Reliability • Security Total cost of ownership calculator: https://www.tco.microsoft.com/
  • 10. Who manages what? Infrastructure as a Service Storage Servers Networking O/S Middleware Virtualization Data Applications Runtime ManagedbyMicrosoft Youscale,make resilient&manage Platform as a Service Scale,Resilienceand managementbyMicrosoft Youmanage Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data On Premises Physical / Virtual Youscale,makeresilientandmanage Storage Servers Networking O/S Middleware Virtualization Data Applications Runtime Software as a Service Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data Scale,Resilienceand managementbyMicrosoft Windows Azure Virtual Machines Windows Azure Cloud Services
  • 11.
  • 12. Questions to ask client • Can you use the cloud? • Is this a new solution or a migration? • Do the developers have Hadoop skills? • Will you use non-relational data (variety)? • How much data do you need to store (volume)? • Is this an OLTP or OLAP/DW solution? • Will you have streaming data (velocity)? • Will you use dashboards? • How fast do the operational reports need to run? • Will you do predictive analytics? • Do you want to use Microsoft tools or open source? • What are your high availability and/or disaster recovery requirements? • Do you need to master the data (MDM)? • Are there any security limitations with storing data in the cloud? • Does this solution require 24/7 client access? • How many concurrent users will be accessing the solution at peak-time and on average? • What is the skill level of the end users? • What is your budget and timeline? • Is the source data cloud-born and/or on-prem born? • How much daily data needs to be imported into the solution? • What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)? • Are you ok with using products that are in preview?
  • 13. DBMS vs NoSQL Decision Tree
  • 14. Big Data Solutions Decision Tree Thanks to Ivan Kosyakov: https://biz-excellence.com/2016/08/30/big-data-dt/
  • 15. Machine Learning Solutions Decision Tree Thanks to Ivan Kosyakov: https://biz-excellence.com/2016/09/13/machine-learning-dt/
  • 16. Enterprise Information Management Decision Tree Thanks to Ivan Kosyakov: https://biz-excellence.com/2017/04/17/eim/
  • 17. Business Intelligence Solutions Decision Tree Thanks to Ivan Kosyakov: https://biz-excellence.com/2017/05/16/bi-decision-tree/
  • 18. SMP vs MPP • Uses many separate CPUs running in parallel to execute a single program • Shared Nothing: Each CPU has its own memory and disk (scale-out) • Segments communicate using high-speed network between nodes MPP - Massively Parallel Processing • Multiple CPUs used to complete individual processes simultaneously • All CPUs share the same memory, disks, and network controllers (scale-up) • All SQL Server implementations up until now have been SMP • Mostly, the solution is housed on a shared SAN SMP - Symmetric Multiprocessing
  • 19. 50 TB 100 TB 500 TB 10 TB 5 PB 1.000 100 10.000 3-5 Way Joins  Joins +  OLAP operations +  Aggregation +  Complex “Where” constraints +  Views  Parallelism 5-10 Way Joins Normalized Multiple, Integrated Stars and Normalized Simple Star Multiple, Integrated Stars TB’s MB’s GB’s Batch Reporting, Repetitive Queries Ad Hoc Queries Data Analysis/Mining Near Real Time Data Feeds Daily Load Weekly Load Strategic, Tactical Strategic Strategic, Tactical Loads Strategic, Tactical Loads, SLA “Query Freedom“ “Query complexity“ “Data Freshness” “Query Data Volume“ “Query Concurrency“ “Mixed Workload” “Schema Sophistication“ “Data Volume” DW SCALABILITY SPIDER CHART MPP – Multidimensional Scalability SMP – Tunable in one dimension on cost of other dimensions The spiderweb depicts important attributes to consider when evaluating Data Warehousing options. Big Data support is newest dimension.
  • 20.
  • 21. Relational Databases vs Non-Relational Databases (NoSQL) vs Hadoop • RDBMS for enterprise OLTP and ACID compliance, or db’s under 5TB • NoSQL for scaled OLTP and JSON documents • Hadoop for big data analytics (OLAP) (from my presentation “Relational Databases vs Non-Relational Databases”)
  • 22. Velocity Volume Per Day Real-world Transactions Per Day Real-world Transactions Per Second Relational DB Document Store Key Value or Wide Column 8 GB 8.64B 100,000 As Is 86 GB 86.4B 1M Tuned* As Is 432 GB 432B 5M Appliance Tuned* As Is 864 GB 864B 10M Clustered Appliance Clustered Servers Tuned* 8,640 GB 8.64T 100M Many Clustered Servers Clustered Servers 43,200 GB 43.2T 500M Many Clustered Servers * Tuned means tuning the model, queries, and/or hardware (more CPU, RAM, and Flash)
  • 23. Microsoft data platform solutions Product Category Description More Info SQL Server 2016 RDBMS Earned top spot in Gartner’s Operational Database magic quadrant. JSON support. Linux TBD https://www.microsoft.com/en-us/server- cloud/products/sql-server-2016/ SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. Has built-in high availability and disaster recovery. JSON support https://azure.microsoft.com/en- us/services/sql-database/ SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. Provision and scale quickly. Can pause service to reduce cost https://azure.microsoft.com/en- us/services/sql-data-warehouse/ Analytics Platform System (APS) MPP RDBMS Big data analytics appliance for high performance and seamless integration of all your data https://www.microsoft.com/en-us/server- cloud/products/analytics-platform- system/ Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics https://azure.microsoft.com/en- us/services/data-lake-store/ Azure Data Lake Analytics On-demand analytics job service/Big Data-as-a- service Cloud-based service that dynamically provisions resources so you can run queries on exabytes of data. Includes U- SQL, a new big data query language https://azure.microsoft.com/en- us/services/data-lake-analytics/ HDInsight PaaS Hadoop compute/Hadoop clusters-as-a-service A managed Apache Hadoop, Spark, R, HBase, Kafka, and Storm cloud service made easy https://azure.microsoft.com/en- us/services/hdinsight/ Azure Cosmos DB PaaS NoSQL: Document Store Get your apps up and running in hours with a fully managed NoSQL database service that indexes, stores, and queries data using familiar SQL syntax https://azure.microsoft.com/en- us/services/documentdb/ Azure Table Storage PaaS NoSQL: Key-value Store Store large amount of semi-structured data in the cloud https://azure.microsoft.com/en- us/services/storage/tables/
  • 24. Microsoft Big Data Portfolio SQL Server Stretch Business intelligence Machine learning analytics Insights Azure SQL Database SQL Server 2016 SQL Server 2016 Fast Track Azure SQL DW ADLS & ADLA Cosmos DB HDInsight Hadoop Analytics Platform System Sequential Scale Out + AcrossScale Up Key Relational Non-relational On-premisesCloud Microsoft has solutions covering and connecting all four quadrants – that’s why SQL Server is one of the most utilized databases in the world
  • 25. Azure SQL Data Warehouse A relational data warehouse-as-a-service, fully managed by Microsoft. Industries first elastic cloud data warehouse with enterprise-grade capabilities. Support your smallest to your largest data storage needs while handling queries up to 100x faster.
  • 26. Azure Data Lake Store A hyper-scale repository for Big Data analytics workloads Hadoop File System (HDFS) for the cloud No limits to scale Store any data in its native format Enterprise-grade access control, encryption at rest Optimized for analytic workload performance
  • 27. Data lake is the center of a big data solution A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements EDW • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • With Hadoop, high availability built in
  • 28. Data Lake Transformation (ELT not ETL) New Approaches All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  • 29. Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure This solves the two biggest reasons why may EDW projects fail: • Too much time spent modeling when you don’t know all of the questions your data needs to answer • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value
  • 30. Data Lake layers • Raw data layer– Raw events are stored for historical reference. Also called staging layer or landing area • Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer • Application data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). Also called workspace layer or trusted layer or presentation layer Still need data governance so your data lake does not turn into a data swamp!
  • 31. Azure HDInsight Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open Source Hortonworks data platform Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industry’s best SLA Familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower TCO than deploy your own Hadoop on-premises* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  • 32. Azure Data Lake Analytics A new distributed analytics service Distributed analytics service built on Apache YARN Elastic scale per query lets users focus on business goals—not configuring hardware Includes U-SQL—a language that unifies the benefits of SQL with the expressive power of C# Integrates with Visual Studio to develop, debug, and tune code faster Federated query across Azure data sources Enterprise-grade role based access control
  • 33. Query data where it lives Easily query data in multiple Azure data stores without moving it to a single store Benefits • Avoid moving large amounts of data across the network between stores (federated query/logical data warehouse) • Single view of data irrespective of physical location • Minimize data proliferation issues caused by maintaining multiple copies • Single query language for all data • Each data store maintains its own sovereignty • Design choices based on the need • Push SQL expressions to remote SQL sources • Filters, Joins • SELECT * FROM EXTERNAL MyDataSource EXECUTE @”Select CustName from Customers WHERE ID=1”; (not pushdown) • SELECT CustName FROM EXTERNAL MyDataSource WHERE ID=1 LOCATION “dbo.Customers” (pushdown) U-SQL Query Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics Azure SQL Data Warehouse Azure Data Lake Storage
  • 34. PolyBase Query relational and non-relational data with T-SQL By preview early this year PolyBase will add support for Teradata, Oracle, SQL Server, MongoDB, and generic ODBC (Spark, Hive, Impala, DB2) Vs U-SQL: PolyBase is interactive while U-SQL is batch. U-SQL more code to query data but more formats (JSON) and libraries/UDOs and supports writes to blob/ADLS
  • 36. PolyBase Reality PolyBase in: Parallelize Data Load (Blob and ADLS) Federated Query (push down) HDInsights Federated Query (push down) HDP/Cloudera (local or blob) Federated Query (push down) New 5 Age Out Data SQL DW Yes N/A N/A No on-prem support Maybe SQL Server 2016 Yes via scale-out groups. Blob, not ALDS N Y (MapReduce job) Y Maybe Supports: UTF-8 and UTF-16 encoded delimited text, RC File, ORC, Parquet, gzip, zlib, Snappy. Not supported: extended ASCII, fixed-file format, WinZip, JSON, and XML. SQL DB not supported SQL DW now supports ADLS but not compute pushdown • ADLS in only two regions (East US 2, Central US) • SQL DW: Think of PolyBase as mechanism for data loading • SQL Server 2016: Think of PolyBase for federated querying • PolyBase supports row sizes up to 1MB • Writes only to blob/ADLS (using CETAS) • Requires External Table command PolyBase parallelized reads:  Supported: in SQL using CTAS or INSERT INTO  Not supported, BCP, Bulk Insert, SQLBulkCopy  Not supported: SSIS (unless used to call stored procedure containing CTAS)  Supported: ADF o If source compatible with PolyBase, will directly copy o If source not compatible, will stage to Blob o If source is ADLS, will still stage to Blob (to be fixed end February)
  • 37. SSAS/Azure Analysis Services Cubes Reasons to report off cubes instead of the data warehouse:  Semantic layer  Handle many concurrent users  Aggregating data for performance  Multidimensional analysis  No joins or relationships  Hierarchies, KPI’s  Security  Advanced time-calculations  Slowly Changing Dimensions (SCD)  Required for some reporting tools
  • 38.
  • 39. Azure Data Lake Store Azure Blob Storage Purpose Optimized for big data analytics General purpose bulk storage Use Cases Batch, Interactive, Streaming App backend, backup data, media storage for streaming Units of Storage Accounts / Folders / Files Accounts / Containers / Blobs Structure Hierarchical File System Flat namespace WebHDFS Implements WebHDFS No (WASB) Security AD SAS keys Storage Auto Shared/Files chunked Manually manage expansion/Files intact Size Limits No limits on account size, file size, # files 500TB account, 4.75TB file Service State Generally Available Generally Available Billing Pay for data stored and for I/O Pay for data stored and for I/O Region Availability Two US regions (Other regions coming soon) All Azure Regions ADL Store vs Blob Store
  • 40. Want Hadoop? Need exact same on- prem Need interactive / streaming? Mandatory No strong opinion Azure Marketplace (IaaS) • Need all workloads exactly like on- premises • Need 100% Hortonworks/Cloudera/MapR Azure HDInsight • Most Hadoop workloads • Fully managed by Microsoft • Sell HDI + ADLS • Stickier to Microsoft than VMs • Can do interactive (Spark) and streaming (Storm/Spark) Azure Data Lake Analytics • Easiest experience for admin: no sense of clusters, instant scale per job • Easiest experience for developers: Visual Studio/U-SQL (C#+SQL) • Sell ADLA + ADLS • Batch workloads only Need everything exactly like on-prem Need core projects Yes Batch is OK Always present ADLA if .NET or Visual Studio Shop If .NET or VS shop?
  • 41. APS with HP CS300 SMP MPP SUPPORTS NON- RELATIONAL CLOUD ●● ● ● ● ● ● ● ● ● PRE- ENGINEERED ●● ● The data warehousing portfolio from Microsoft Comprehensive solutions ●
  • 42. Azure SQL DW HDInsight Hive HDInsight Spark Azure Data Lake SQL Server (IaaS) Volume Petabytes Petabytes Petabytes Petabytes Terabytes Security Encryption, TD, Audit ADLS / Apache Ranger ADLS AAD Security Groups (data) Encryption, TD Audit Languages T-SQL (subset) HiveQL SparkSQL, HiveQL, Scala, Java, Python, R U-SQL T-SQL Extensibility No Yes, .NET/SerDe Yes, Packages Yes, .NET Yes, .NET CLR External File Types ORC, TXT, Parquet, RCFile ORC, CSV, Parquet + others Parquet, JSON, Hive + others Many ORC, TXT, Parquet, RCFile Admin Low-Medium Medium-High Medium-High Low High Cost Model DWU Nodes & VM Nodes & VM Units/Jobs VM Schema Definition Schema on Write / Polybase Schema on Read Schema on Read Schema on Read Schema on Write / Polybase Max DB Size 240TB Comp (5X = 1PB) Unlimited 64TB (64 1TB drives)
  • 43. Data Warehouse Future SQL DW • Replicated tables in private preview (it’s a cache) • 10PB max db size this summer SQL DB • 4TB in public preview (1TB now) • Project Cloud Lift, instance level, 35TB max db, true SQL Server compatibility (cross-database queries), private preview March CY17, public preview H2CY17; Socrates: 100TB max db VM • GS5: 32 cores, 448GB memory, 80k disk IOPS • Superdome X: 384 cores, 24TB memory, 92TB disk • Larger disks Q2CY17 (up to 4TB SSDs), so 256TB max database; 8TB end CY17, 32TB CY18 • New VM sizes with much more cores and memory on the way • SQL14/SQL16 have a feature called “Data Files in Azure Storage Blobs” that allows it to store its data/log files on as many Blobs as desired. This allows going above the VM storage limit. Writes are the same. Reads are slower (1ms to 5ms) given that there is no read cache
  • 44. Data Lake Data Warehouse Complementary to DW Can be sourced from Data Lake Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do experimentation / data discovery System of Record: Well-understood data to do operational reporting Any type of data Limited set of data types (ie. relational) Skills are limited Skills mostly available All workloads – batch, interactive, streaming, machine learning Optimized for interactive querying
  • 45. Roles when using both Data Lake and DW Data Lake/Hadoop (staging and processing environment) • Batch reporting • Data refinement/cleaning • ETL workloads • Store historical data • Sandbox for data exploration • One-time reports • Data scientist workloads • Quick results Data Warehouse/RDBMS (serving and compliance environment) • Low latency • High number of users • Additional security • Large support for tools • Easily create reports (Self-service BI) • A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
  • 46.
  • 47. Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Lambda Architecture : Interactive Analytics Pipeline Data Consumption (Ingestion) Stream Layer (data in motion) Batch Layer (data at rest) Presentation/Serving Layer
  • 48. Microsoft Products vs Hadoop/OSS Products Microsoft Product Hadoop/Open Source Software Product Office365/Excel OpenOffice/Calc DocumentDB MongoDB, HBase, Cassandra SQL Database SQLite, MySQL, PostgreSQL, MariaDB Azure Data Lake Analytics/YARN None Azure VM/IaaS OpenStack Blob Storage HDFS, Ceph (Note: These are distributed file systems and Blob storage is not distributed) Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion Event Hub Apache Kafka Azure Stream Analytics Apache Storm, Apache Spark, Twitter Heron Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay) Azure ML Apache Mahout, Apache Spark MLib Microsoft R Open R SQL Data Warehouse Apache Hive, Apache Drill, Presto IoT Hub Apache NiFi Azure Data Factory Apache Falcon, Apache Oozie, Airbnb Airflow Azure Data Lake Storage/WebHDFS HDFS Ozone Azure Analysis Services/SSAS Apache Kylin, Apache Lens, AtScale (pay) SQL Server Reporting Services None Hadoop Indexes Jethro Data (pay) Azure Data Catalog Apache Atlas PolyBase Apache Drill Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES) Others Apache Flink, Apache Ambari, Apache Ranger, Apache Knox Note: Many of the Hadoop/OSS products are available in Azure
  • 49. Business apps Custom apps Sensors and devices Events Events Spark Streaming Stream Processing Azure Stream Analytics Event Processing Azure Event Hubs Kafka Events Events
  • 50. Choosing a Ingestion Technology Kafka Azure Event Hubs Managed No Yes Ordering Yes Yes Delivery At-least-once At-least-once Lifetime Configurable 1-30 Days Replication Configurable within Region Yes Throughput *nodes 20 throughput units Parallel Clients Yes No MapReduce Yes No Record Size Configurable 256K Cost Low + Admin Low
  • 51. Choosing a Stream Processing Technology Azure Stream Analytics Storm Spark Streaming Managed Yes Yes Yes Temporal Operators Windowed aggregates, and temporal joins are supported out of the box. Temporal operators must to be implemented Temporal operators must to be implemented Development Experience Interactive authoring and debugging experience through Azure Portal on sample data. Visual Studio, etc Visual Studio, etc Data Encoding formats Stream Analytics requires UTF-8 data format to be utilized. Any data encoding format may be implemented via custom code. Any data encoding format may be implemented via custom code. Scalability Number of Streaming Units for each job. Each Streaming Unit processes up to 1MB/s. Max of 50 units by default. Call to increase limit. Number of nodes in the HDI Storm cluster. No limit on number of nodes (Top limit defined by your Azure quota). Call to increase limit. Number of nodes in the HDI Spark cluster. No limit on number of nodes (Top limit defined by your Azure quota). Call to increase limit. Data processing limits Users can scale up or down number of Streaming Units to increase data processing or optimize costs. Scale up to 1 GB/s User can scale up or down cluster size to meet needs. User can scale up or down cluster size to meet needs. Late arrival and out of order event handling Built-in configurable policies to reorder, drop events or adjust event time. User must implement logic to handle this scenario. User must implement logic to handle this scenario.
  • 53. Cloud Big Data Solution
  • 54. Excel Third party BI tools Cloud data sources SQL Database SQL Data Warehouse Direct Query Cached Model Power BI Power BI Embedded SQL Server Other data sources Power BI Desktop Visual Studio Authoring and development tools On-premises data sources Teradata Oracle Direct Query Cached Model Gateway Cloud visualization tools On-premises visualization tools Azure Analysis Services Analytics Platform System Other data sources
  • 55. Interactive Analytics and Predictive Pipeline using Azure Data Factory Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Machine Learning (Failure and RCA Predictions) Azure SQL (Predictions) HDI Custom ETL Aggregate /Partition Azure Storage Blob dashboard of predictions / alerts PowerBI dashboard (Shared with field Ops, customers, MIS, and Engineers) Baseline Architecture : Interactive Analytics Pipeline
  • 56. Near Realtime Data Analytics Pipeline using Azure Steam Analytics Big Data Analytics Pipeline using Azure Data Lake Interactive Analytics and Predictive Pipeline using Azure Data Factory Base Architecture : Big Data Advanced Analytics Pipeline Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Machine Learning (Failure and RCA Predictions) Telemetry Azure SQL (Predictions) HDI Custom ETL Aggregate /Partition Azure Storage Blob dashboard of predictions / alerts Live / real-time data stats, Anomalies and aggregates Custome r MIS Event Hub PowerBI dashboard Stream Analytics (real-time analytics) Azure Data Lake Analytics (Big Data Processing) Azure Data Lake Storage Azure SQL (COL + TACOPS) Data in MotionData at Rest dashboard of operational stats FDS + SDS (Shared with field Ops, customers, MIS, and Engineers) Scheduledhourly transferusingAzure DataFactory Machine Learning (Anomaly Detection)
  • 57.
  • 58. Schneider Electric Architecture Event hubs Machine Learning Flatten & Metadata Join Data Factory: Move Data, Orchestrate, Schedule, and Monitor Machine Learning Azure SQL Data Warehouse Power BI INGEST PREPARE ANALYZE PUBLISH ASA Job Rule #2 CONSUMEDATA SOURCES Cortana Web/LOB Dashboards On Premise Hot Path Cold Path Archived Data Data Lake Store Simulated Sensors and devices Blobs – Reference Data Event hubs ASA Job Rule #1 Event hubs Real-time Scoring Aggregated Data Data Lake Store CSV Data Data Lake Store Data Lake Analytics Batch Scoring Offline Training Hourly, Daily, Monthly Roll Ups Ingestion Batch PresentationSpeed
  • 59. Resources  Relational database vs Non-relational databases: http://bit.ly/1HXn2Rt  Types of NoSQL databases: http://bit.ly/1HXn8Zl  What is Polyglot Persistence? http://bit.ly/1HXnhMm  Hadoop and Data Warehouses: http://bit.ly/1xuXfu9  Hadoop and Microsoft: http://bit.ly/20Cg2hA
  • 60. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted via the “Presentations” link on the top menu)