More Related Content Similar to How to select a modern data warehouse and get the most out of it? (20) More from Slim Baltagi (20) How to select a modern data warehouse and get the most out of it?1. mycervello.com
How to select a modern cloud data
warehouse and get the most out of it
18th June 2019
Slim Baltagi
Director, Big Data & ML
NYC Advanced Analytics Meetup
Jim Leavitt
Vice President
co-CEO & Founder of Cervello
2. © 2019 Cervello, an A.T. Kearney company
Agenda
PART I (Slim Baltagi – 35 minutes)
1. Key Terms
2. Traditional Data Warehouses vs. Modern Data Warehouses
3. How to select a cloud data warehouse?
PART II (Jim Leavitt – 20 minutes)
1. Data era
2. Business use cases
3. How to get/make the most of a modern data warehouse?
PART III ( Discussion with attendees – 30 minutes)
2
4. © 2019 Cervello, an A.T. Kearney company
1. Key Terms: Cloud
• What is exactly the cloud? According to Gartner, ‘A style of computing in which scalable and
elastic IT-enabled capabilities are delivered as a service using Internet technologies’
• What are the models within the cloud and how do they apply to data warehousing?
4
Hardware:
Datacenter
Software: Data warehouse Management:
Optimizing, tuning,
maintenance
Kitchen, Cook/Order,
Serve, Eat, Clean
On-premises You You You At home
Infrastructure IaaS
Example: EC2
Vendor You You At a vacation
home
Platform PaaS
Example: Amazon
Redshift, EMR
Vendor Vendor You At a fast food
restaurant
Software SaaS
Example: Snowflake
Vendor Vendor Vendor At a full service
restaurant
5. © 2019 Cervello, an A.T. Kearney company
1. Key Terms: Data Warehouse
• What is a data warehouse? ‘A subject-oriented, integrated, time-variant, non-volatile collection of data in
support of management’s decision-making process.’ Source: Building the data warehouse, a book by Bill
Inmon, who is considered to be the father of data warehousing
• Since the beginning, data warehousing was about the business making better and quicker decisions
• Over the last 30 years, data warehouses evolved from traditional DBMS to Specialized Analytics DBMS to
Out-of-the-box data warehouse appliances to Hadoop/Spark-based data warehouses to ‘cloudified’, to finally
built-for-the cloud data warehouses.
• A ’cloudified’ data warehouse is a cloud-based representation of a traditional data warehouse (such as
Amazon Redshift and Microsoft SQL Data warehouse) while a data warehouse ‘built for the cloud’ (such as
Google BigQuery and Snowflake) is a data warehouse designed from the ground up to run natively on the
cloud and fully deliver on the elasticity of the cloud.
• A key factor driving the evolution of data warehousing is the cloud. Why?
• In this modern era of cloud, social networks and mobile, the ‘original’ definition of a data warehouse might
need an update! How about additional attributes such as elastic, scalable (All data, All users), agile, shareable,
conductive to exploratory analytics. What do you think?
5
6. © 2019 Cervello, an A.T. Kearney company
Aspect Traditional Data warehouse Modern Data Warehouse
Offering mode Product Service
Deployment mode On-premise, cloudified Built for the Cloud
Data age Historical Historical, Live
Data sources Traditional such as ERP, CRM, … Additional such as social media, sensor data, …
Storage & compute Tightly coupled Decoupled
Data processing Batch Batch, Micro-batch, Streaming
Data format Structured data Structured, Semi-Structured & Unstructured data
Data coverage Only your most critical data Possibly all of your available data
Data transformation ETL ETL, ELT
Usage Primarily BI BI, Advanced Analytics, Service
Cost Model Pay upfront Pay-as-you-go
Usability More complex Easier
Management Burdens on the DBA Fully managed
Time to market Usually longer Usually shorter
Provisioning Usual delays for infrastructure Almost instant
Time to insight Usually Longer Usually Shorter
Agility Schema-on-write Schema-on-read, Schema-on-write
Flexibility Static resources On-demand resources
1. Key Terms: Modern
6
7. © 2019 Cervello, an A.T. Kearney company
2. Traditional Data Warehouses: Key Challenges
1. Scalability: growth in data volume, number of users and applications
2. Concurrency: as number of users increase, they can not operate simultaneously
3. Performance: slow running queries, …
4. Resilience: data backup/retention and node failure protection
5. Complexity: from initial implementation to ongoing maintenance
6. Lack of native support for semi-structured data: JSON and XML formats are popular for
data exchange. In traditional data warehouses, data needs to be transformed first and
schema needs to be defined before loading
7. High maintenance overhead in the form of constant indexing, tuning, sorting
8. Handling workload fluctuation: sizing servers for workload peaks and valleys
9. Waste of resources: expensive computing resources are wasted during off peak usage
10. Lack of support for processing streaming data
11. Upfront costs: hardware, software licenses, staffing, …
12. Project delays due to provisioning infrastructure
Sure, you are familiar with real world scenarios such as End of Month Reporting, Intensive Load
Process, Demanding Executive Dashboard Users, …
7
8. © 2019 Cervello, an A.T. Kearney company
2. Modern Data Warehouses: How they overcome above challenges?
Major modern data warehouses are cloud based. Most of them do overcome the key challenges
of the traditional data warehouses. Unfortunately, overcoming these challenges is happening at
different degrees and it is not always a simple check of a yes or no!!
1. Scalability: Scale up, down, or off quickly without delay. Yes for Snowflake. Low for Redshift,
Yes for Azure Data Warehouse, Yes for Big Query ( but with limits)
2. Concurrency: Lots of users can operate simultaneously
3. Performance: processing bottlenecks and delays, slow running queries, …
4. Resilience: data backup/retention and node failure protection
5. Complexity: Cloud data warehouses are easier to use than on-premise data warehouses
6. Lack of native support of semi-structured data: Although cloud data warehouse do support
semi-structured data such as JSON. This support varies from one data warehouse to
another.
oConcurrent throughput for JSON: Yes for Snowflake, No for Redshift, No for Azure Data
Warehouse, Moderate for Big Query.
oHigh performance for JSON scans: Yes for Snowflake, No for Redshift unless you use the
additional Amazon Spectrum tool, No for Azure Data Warehouse, Moderate for Big Query.
8
9. © 2019 Cervello, an A.T. Kearney company
2. Modern Data Warehouses
7. High maintenance overhead is not a major issue with managed infrastructure for cloud data
warehouses, but the level of maintenance varies from one data warehouse to another. This
fluctuates to creating indices and distribution keys to near zero management
8. Handling workload fluctuation: With elasticity, cloud data warehouses adapt to workload peaks
and valleys
9. Waste of resources: No more wasted resources during off peak usage! Auto suspend, Auto
resume?
10. Lack of support for streaming data: For example, loading and processing is possible with
Snowpipe, a serverless compute service from Snowflake data warehouse
11. Upfront costs: No upfront costs as the model of cloud data warehouse is pay as you go
12. Project Delays due to provisioning infrastructure: Not Applicable for cloud data warehouse.
With instant provisioning, projects don’t need to wait for infrastructure
9
10. © 2019 Cervello, an A.T. Kearney company
3. How to select a Cloud Data Warehouse?
• You’ll come across some marketing fluff when you are researching and selecting cloud data
warehouses. Example: One cloud data warehouse claims to be 100X faster than the other.
Another ‘modern’ one claims 100,000% cost savings compared to a legacy ‘one’!
• Often cloud data warehouses are compared against a handful set of criteria only and sometimes
even a single criteria: performance, using a workload derived from a so called an industry
standard benchmark.
• A fit for purpose comparison across key categories and evaluation criteria with a focus on your
own use cases would be more practical. Example of a category: Data & Service Availability.
Related criteria: Failure recovery, Disaster recovery, Data protection, Service monitoring &
altering.
• At Cervello, we came with a 4-Step Tool Evaluation Framework that integrates input from
analysts, vendors, our own perspective and the client current requirements and future state.
• Be aware that the most popular cloud data warehouse might not be the best fit for your
organization!
• Selecting a cloud data warehouse for your organization is all about figuring out what is the best
fit for your business, your budget and your employees.
10
11. © 2019 Cervello, an A.T. Kearney company
Cervello Tool Evaluation Framework
11
Prioritized and weighted selection
criterion based on the desired future
state
Implementation and in-
depth expertise
Vendor ecosystem
perspective
Analyst reports
and research
CLIENT70%
CERVELLO15%
ECOSYSTEM10 %
MARKET5%
Client prioritizes and weights
key criterion and scores the
vendors based on the
criterion
Cervello provides our
perspective based on
experience and intelligence of
the vendors in the market place
Further refine the list based on
vendor responses to key criterion
as well as information from the
vendor ecosystem
Leverage Gartner, Forrester, and
other market analysts to establish a
baseline list of vendors to evaluate
weightingfactorfordrivingconsensusonadecision
START
FINISH
12. © 2019 Cervello, an A.T. Kearney company
Service Architecture Database
Mono-Cloud or Multi-Cloud Separation of Compute and Storage Support of ANSI SQL
PaaS or SaaS? Elasticity Support of Semi-structured data
‘Cloudified’ or ‘built for the
cloud’
Concurrency Support of multiple file format
Integration with cloud provider
services & tools
Performance Support of diverse data types
Cost Model & Transparency Scalability Handling of variable schema
Usability Support for parallel data upload Support of window functions
Management Overhead Support for streaming data Support of stored procedures
High Availability Security & Regulations Query Optimization
Service Maintenance Built-In Optimization Metadata & Statistics Maintenance
Configuration Pause/Resume mechanism Tuning Where Clauses
Service Limits Data Ingestion Tuning Joins
Initialization Time Support of mixed workloads Support of advanced analytics functions
Service updates without
disruption
Integration with tools such us dataflow ones, BI, ML, … Support for materialized views
Service Monitoring & Alerting Continuous Data protection Mechanisms for extensibility
Some Evaluation Criteria of a Cloud Data Warehouse
12
13. © 2019 Cervello, an A.T. Kearney company
Example of a Modern Data Warehouse: Snowflake Data Warehouse
• Snowflake is based on three key innovations:
1. Unique architecture: a unique architecture designed for the cloud and able to provide
complete elasticity for all your concurrent users and applications
2. Database engine that natively handles all your data both semi structured and structured
data without sacrificing performance nor flexibility
3. Technology that eliminates the need for manual data warehouse management and tuning.
No indexing, tuning, partitioning or vacuuming after loading data => effortless management
• Snowflake architecture consists of three layers, each one is physically decoupled from the
other layer and scales independently:
1. Data Storage layer uses cloud storage to store all data loaded into snowflake in a scalable
and inexpensive way
2. Compute layer comprises of virtual warehouses compute resources that execute data
processing tasks required for queries. The virtual warehouses have access to all of the
data in the storage layer
3. Cloud Services layer coordinates the entire system managing security, optimization and
metadata
13
14. © 2019 Cervello, an A.T. Kearney company
Snowflake’s multi-cluster, shared data architecture
14
15. © 2019 Cervello, an A.T. Kearney company 15
• Data Sources/Data Providers: On-premise or Cloud, Structured or Semi-Structured Data
• Dataflow Tools: Streaming Data Platforms, ETL and ELT platforms. E: Extract, L: Load, T: Transform
• Data Storage: S3/Azure Storage (missing in the architecture diagram!)
• Analytics Services: BI, UI, Data Science, …
Snowflake reference architecture
16. © 2019 Cervello, an A.T. Kearney company
Trying Snowflake data warehouse
• A couple unique features:
• Multi-cloud: Snowflake is the only cloud data warehouse that is offered on Amazon AWS, Microsoft Azure
and soon on Google Cloud Platform.
• Zero copy cloning: A technique used to quickly replicate a database, without any physical data copy, to
build a fully populated test environment. This can ease the burden of DEVOPs, as terabytes of data can be
cloned within seconds with subsequent inserts and updates allowed on the new data-set.
• Time Travel: Time Travel allows you to track the change of data over time. This feature is available to all
accounts and enabled by default to all. It allows you to access the historical data of a Table. One can find
out how the table looked at any point in time within the last 90 days.
• Fail-Safe: Fail-Safe ensures historical data is protected in event of disasters such as disk failures or any
other hardware failures. Snowflake provides 7 days of Fail-Safe protection of data which can be recovered
only by Snowflake in event of a disaster.
• Data Sharing: Provides access to both compute and data resources to external partners or subsidiaries on
a read-only basis. This avoids the need to build multiple Extract Transform and Load (ETL) pipelines to
external users, and avoids the need for Change Data Capture (CDC) routines when the warehouse data is
updated, as users always view the latest data.
• Try it yourself: most cloud data warehouses offer a free trial. Example:
• Snowflake Free trial! https://bit.ly/2WMVKp3
• You can build a short term POC in either AWS or Azure while using US $400 credit for 30 days for both
compute and storage. Snowflake on Google Cloud Platform (GCP) is being offered soon.
16
18. © 2019 Cervello, an A.T. Kearney company
1. Data Era: Challenges + Opportunities
20. © 2019 Cervello, an A.T. Kearney company
2. Next Generation of Business Use Cases
§ Data as a Service
§ Self-Service Analytics
§ Advanced Analytics Lab
§ Real-Time Insight
§ Single View of Customer
20
Data Consumers
§ Partners, Suppliers, and Vendors
§ Executives, Managers, and Analysts
§ Data Scientists
§ Embedded Applications
§ Many Business Users
New Use Cases
21. © 2019 Cervello, an A.T. Kearney company
Medical Device Client: Data + Analytics Overview
21
22. © 2019 Cervello, an A.T. Kearney company
Medical Device Client: Data + Analytics Benefits
22
Cost
• $500k IT expense reduction
from cloud and Snowflake
migration and consolidated BI
toolset
• Continuous process
simplification
• $1M annual savings from freight
consolidation analytics
• 500 annual hours shifted to
higher value activities because
of end-to-end automating
reporting (e.g., fill rates, open
orders)
Revenue
• $800K in annual revenue
recovery through contract
enforcement analytics
• Improved patient + physician
experience (in software product)
with embedded reporting
Quality
• Data validation checks built into
daily processing
Future
• Smart medical devices (IoT)
23. © 2019 Cervello, an A.T. Kearney company
Technical + business
approach
• Business
engagement
• Best
practices/advisory
• System integration
Data asset
• Think Big (but
organized)
• Internal + external
• Structured/
unstructured,
Governed/loosely
governed
3. How To Drive Value
23
Modern technology
stack
• Flexible and modular
• Scalable on demand
• Machine Learning +
Artificial Intelligence
Process + organization
• Supply (Data Engineers) +
consumption
(Scientists/Analysts)
• Automated insights =
decision-making
opportunity
• Organizational opportunity
24. © 2019 Cervello, an A.T. Kearney company
Cervello Profile
We are a professional services firm focused on helping organizations win with data. We have a global presence with an ability to service customers across
industries. We are organized around three practices that are under pinned by our expertise in modern data architectures.
24
PERFORMANCE
MANAGEMENT
SUPPLIER + CUSTOMER
RELATIONSHIPS
DATA MONETIZATION +
PRODUCTS
DATA MANAGEMENT
& MACHINE LEARNING
We embed data management
processes and use machine
learning to improve data quality
Using leading cloud technology
platforms we develop and
design analytics-ready
connected data solutions
MODERN DATA
ARCHITECTURE & PLATFORMS
ANALYTICS & DATA SCIENCE
MODELING & PLANNING
MOBILE EXPERIENCE
We provide different methods
to consume data to facilitate
insights and actions inside and
outside the organization
BUSINESS DRIVEN
INSIGHTS & ACTIONS
HOWWEDELIVER
WHATWEDELIVER
STRATEGY
BUILD
SERVICES
RUN & COE
SERVICES
Our teams are located in Boston, New York, Dallas, London and Bangalore
25. Boston | New York | Dallas | London
Get in touch with contributors:
Thank You!
Learn more about Cervello at
mycervello.com
© 2019 Cervello, an A.T. Kearney company
Slim Baltagi
sbaltagi@gmail.com
https://www.linkedin.com/in/slimbaltagi/
@SlimBaltagi
25
Jim Leavitt
jleavitt@mycervello.com
https://www.linkedin.com/in/jimleavitt/