Data is the foundation of any meaningful corporate initiative. Fully master the necessary data, and you’re more than halfway to success. That’s why leverageable (i.e., multiple use) artifacts of the enterprise data environment are so critical to enterprise success.
Build them once (keep them updated), and use again many, many times for many and diverse ends. The data warehouse remains focused strongly on this goal. And that may be why, nearly 40 years after the first database was labeled a “data warehouse,” analytic database products still target the data warehouse.
4. Proprietary + Confidential
*Source: https://emtemp.gcom.cloud/ngw/globalassets/en/information-technology/documents/trends/gartner-2019-cio-agenda-key-takeaways.pdf
Rebalance Your Technology Portfolio Toward
Digital Transformation
Gartner: Digital-fueled growth is the top investment priority for technology leaders.*
Percent of respondents
increasing investment
Percent of respondents
decreasing investment
Cyber/information security 40%
1%
Cloud services or solutions (Saas, Paa5, etc.) 33%
2%
Core system improvements/transformation 31%
10%
How to implement product-centric delivery (by percentage of respondents)
Business Intelligence or data analytics solution 45%
1%
Digital
Transformation
5. Proprietary + Confidential
Governed metrics | Best-in-class APIs | In-database | Git version-control | Security | Cloud
Integrated Insights
Sales reps enter discussions
equipped with more context
and usage data embedded
within Salesforce.
Data-driven Workflows
Reduce customer churn
with automated email
campaigns if customer
health drops
Custom Applications
Maintain optimal inventory levels
and pricing with merchandising
and supply chain management
application
Modern BI & Analytics
Self-service analytics for
install operations, sales
pipeline management,
and customer operations
SQL In Results Back
6. Proprietary + Confidential
‘API-first’ extensibility
Technology Layers
Semantic modeling layer
In-database architecture
Built on the cloud
strategy of your choice
7. Proprietary + Confidential
1 in 2
customers integrate
insights/experiences beyond
Looker
2000+
Customers
5000+
Developers
Empower People with the Smarter Use of Data
10. Comparing the Enterprise
Analytic Solutions
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444
11. William McKnight
President, McKnight Consulting Group
• Consulted to Pfizer, Scotiabank, Fidelity, TD Ameritrade,
Teva Pharmaceuticals, Verizon, and many other Global
1000 companies
• Frequent keynote speaker and trainer internationally
• Hundreds of articles, blogs and white papers in publication
• Focused on delivering business value and solving business
problems utilizing proven, streamlined approaches to
information management
• Former Database Engineer, Fortune 50 Information
Technology executive and Ernst&Young Entrepreneur of
Year Finalist
• Owner/consultant: Data strategy and implementation
consulting firm
2
17. Best Category and Top Tool Picked
Best Category Picked
Top 2 Category Picked
Same Ol’ Platform
80%
70%
60%
50%
Increasing Probability that Platform
Selection Leads to Success
20. Modern Use Cases
Data Lake
Data Warehouse Data Lake
Machine
Learning
Categorical Model
(e.g. Decision Tree)
Categorical Data Quantitative Data
Split
Quantitative Model
(e.g. Regression)
Train Train
Score Score
Evaluate
Historical
Transaction Data
Deploy
Scores
Real Time
Transactions
Actions
21. Analytics Reference Architecture
Logs
(Apps, Web,
Devices)
User
tracking
Operational
Metrics
Offload
data
Raw Data Topics
JSON, AVRO
Processed
Data Topics
Sensors
Transactional
/ Context
Data
OLTP/ODS
ETL
Or
EL with
T in Spark
Batch
Low
Latency
Applications
Files
Reach
through
or ETL/ELT
or
Import
Stream
Processing
Stream
Processing
Q
Q
Distributed
Analytical
Warehouse
Governed Data Lake
Data Governance
25. Beyond Performance Checklist
• Cost Predictability and
Transparency
• Multi-Cluster Costs
• In-Database Machine Learning
• SQL Compatibility
• Provisioning Workloads with
Security Controls
• ML Security same as Database
Security
• Resource Elasticity
• Automated Resource Elasticity
• Granular Resource Elasticity
• Licensing Structure
• Cost Conscious Features
• Data Storage Alternatives
• Unstructured and Semi-Structured
Data Support
• Streaming Data Support
• Connectivity with standard ETL and
Data Visualization software
• Concurrency Scaling
• Seamless Upgrades
• Hot Pluggable Components
• Single Point of Entry for System
Administration
• Easy Administration
• Optimizer Robustness
• Disaster Recovery
• Workload Isolation
16
26. • Login to AWS Console https://console.aws.amazon.com/
• Create and Launch EC2 Instance
– Choose your Amazon Machine Instance
– Choose your Instance Type
– Add Storage
– Configure Security Group
– PEM Key Pair
– Connect/SSH to Instance
– Access keys
• Set up S3 Storage
– Create bucket
• Set up Redshift
– Identity & Access Management
– Create Role and Attach Policies
– Configure Cluster
– Launch Cluster
• Load Data
• Query Data
Enterprise Analytic Solutions Setup (i.e.,
EC2, S3, Redshift)
27. • Create Statistics
• Manual & Automatic Snapshots
• Distribution Keys
• Elastic Resize
• Vacuum Tables
• Cluster Parameter Group (for Workload
Management)
• Short Query Acceleration
Other Concepts (Redshift example)
28. • Azure SQL Data Warehouse is scaled by Data Warehouse Units (DWUs) which are
bundled combinations of CPU, memory, and I/O. According to Microsoft, DWUs
are “abstract, normalized measures of compute resources and performance.”
• Amazon Redshift uses EC2-like instances with tightly-coupled compute and
storage nodes which is a “node” in a more conventional sense
• Snowflake “nodes” are loosely defined as a measure of virtual compute
resources. Their architecture is described as “a hybrid of traditional shared-disk
database architectures and shared-nothing database architectures.” Thus, it is
difficult to infer what a “node” actually is.
• Google BigQuery does not use the concept of a node at all, but instead refers to
“slots” as “a unit of computational capacity required to execute SQL queries"
Different Terminology
33. Actian Avalanche
• MPP relational columnar database built to deliver high performance at low TCO both in the cloud and
on-prem for BI and operational analytics use cases.
• Actian Avalanche is based on its underlying technology, known as Vector. The basic architecture of
Actian Avalanche is the Actian patented X100 engine, which utilizes a concept known as "vectorized
query execution" where processing of data is done in chunks of cache-fitting vectors.
• Avalanche performs “single instruction, multiple data” processes by leveraging the same operation on
multiple data simultaneously and exploiting the parallelism capabilities of modern hardware. It
reduces overhead found in conventional "one-row-at-a-time processing" found in other platforms.
Additionally, the compressed column-oriented format uses a scan-optimized buffer manager.
• The measure of Actian Avalanche compute power is known as Avalanche Units (AU). The price is per
AU per hour and includes both compute and cluster storage.
• It’s a pure column store
• Compression is typically 5:1
• Multi-Core Parallelism
• CPU Cache is Used as Execution Memory – Process data in chip cache not RAM
• Storage Indexes are created automatically by quickly identifying candidate data blocks for solving
queries
• Fat and cost-effective
34. Amazon Redshift
• Amazon Redshift was the first managed data warehouse service and continues to get a high
level of mindshare in this category.
• One of the interesting features of Redshift is result set caching.
• At the enterprise class, Redshift dense compute nodes (dc2.8xlarge) have 2.56TB per node of
solid state drives (SSD) local storage. Their dense storage nodes (ds2.8xlarge) have 16TB per
node, but it is on spinning hard disks (HDD) with slower I/O performance.
• Redshift has some future- proofing (like Spectrum and short query acceleration) that a modern
data engineering approach might utilize. Short query acceleration uses machine learning to
provide higher performance, faster results, and better predictability of query execution times.
• Amazon Redshift is a fit for organizations needing a data warehouse with a clear, consistent
pricing model. Amazon Web Services supports most of the databases in this report, and then
some. Redshift is not the only analytic database on AWS, although sometimes this gets
convoluted.
35. Azure Synapse
• Azure SQL Data Warehouse made its debut for public use in mid-2016. This is a managed
service, dedicated data warehouse offering from the DATAllegro/PDW/APS legacy. Azure SQL
Data Warehouse Gen 2, optimized for compute, is a massive parallel processing and shared
nothing architecture on cluster nodes each running Azure SQL Database—which shares the
same codebase as Microsoft SQL Server.
• Azure SQL Data Warehouse supports 128 concurrent queries, a nice, high relative number.
• Microsoft also has a deep partnership with Databricks, which is becoming very popular in the
data science community. The partnership uses Azure Active Directory to log into the database.
• Overall, Azure SQL Data Warehouse continues to be an excellent choice for companies needing a
high-performance and scalable analytical database in the cloud or to augment the current, on-
premises offering with a hybrid architecture at a reasonable cost.
36. Cloudera Data Warehouse Service
• Cloudera Data Warehouse (CDW) boasts flexibility through support for both data center
and multiple public cloud deployments, as well as capabilities across analytical,
operational, data lake, data science, security, and governance needs.
• CDW is part of CDP, a secure and governed cloud service platform that offers a broad set
of enterprise data cloud services with the key data functionality for the modern
enterprise. CDP was designed to address multi-faceted needs by offering multi-function
data management and analytics to solve an enterprise’s most pressing data and analytic
challenges in a streamlined fashion.
• The architecture and deployment of CDP begins with the Management Console, where
several important tasks are performed. First, the preferred cloud environment (for
example, AWS or Azure) is set up. Second, data warehouse clusters and machine learning
(ML) workspaces are launched. Third, additional services, such as Data Catalog, Workload
Experience Manager, and Replication Manager are utilized, if required.
• The Cloudera Data Warehouse service provides self-service independent virtual
warehouses running on top of the data kept in a cloud object store, such as S3.
37. Google BigQuery
• Google BigQuery has the most distinctive approach to cloud analytic databases, with an
ecosystem of products for data ingestion and manipulation and a unique pricing apparatus.
• The back end is abstracted, BigQuery acts as a RESTful front end to all the Google Cloud storage
needed, with all data replicated geographically and Google managing where queries execute.
(The customer can choose the jurisdictions of their storage according to their safe harbor and
cross-border restrictions.)
• Pricing is by data and query, including Data Definition Language (DDL), or by flat rate pricing by
“slot,” a unit of computational capacity required to execute SQL queries. This price model may
make sense for high-data usage customers. Google also lowers the cost of unused storage.
• Google Marketing Platform data (including the former DoubleClick), Salesforce.com,
AccuWeather, Dow Jones, and 70+ other public data sets out there can be included in the
BigQuery dataset
• Billing is based on the amount of data you query and store. Customers can pre-purchase flat-
rate computation “slots” or units in increments per month per 500 compute units. However,
Google recently introduced Flex Slots, which allow slot reservations as short as one minute and
billed by the hour. There is a separate charge for active storage of data.
38. Microfocus Vertica in Eon Mode
• Vertica is owned by Micro Focus. They also introduced their Vertica in
Eon Mode deployment as the way to set up a Vertica cluster in the
cloud. Vertica in Eon mode is a fully ANSI SQL compliant relational
database management system that separates compute from storage.
• Vertica is built on Massively Parallel Processing (MPP) and columnar-
based architecture that scales and provides high-speed analytics.
• Vertica offers two deployment modes – Vertica in Enterprise Mode
and Vertica in Eon Mode. Vertica in Eon Mode uses a dedicated
Amazon S3 bucket for storage, with a varying number of compute
nodes spun up as necessary to meet the demands of the workloads.
• Vertica in Eon Mode also allows the database to be turned “off”
without cluster disruption when turned back “on.” Vertica in Eon
Mode also has workload management and its compute nodes can
access ORC and Parquet formats in other S3 clusters.
39. Snowflake Data Warehouse
• Snowflake Computing was founded in 2012 as the first data warehouse purpose-built for the
cloud. Snowflake has seen tremendous adoption, including international accounts and
deployment on Azure cloud.
• Snowflake’s compute scales in full cluster increments with node counts in powers of two.
Spinning up or down is instant and requires no manual intervention, resulting in leaner
operations. Snowflake scales linearly with the cluster (i.e., for a four node cluster, moving to the
next incremental size will result in a four node expansion).
• Regarding billing, you pay per second only for the compute in use.
• On Amazon AWS, Snowflake is architected to use Amazon S3 as its storage layer and has a native
advantage of being able to access an S3 bucket within the COPY command syntax. On Microsoft
Azure, it uses Azure Blob store.
• The UI is well regarded.
40. Teradata Vantage
• Teradata is available on Amazon Web Services, Teradata
Cloud, VMware, Microsoft Azure, on-premises, and
IntelliFlex – Teradata’s latest MPP architecture with
separate storage and compute.
• With Vantage, Teradata is still the gold standard in
complex mixed workload query situations for enterprise-
level, worry-free concurrency as well as scaling
requirements and predictably excellent performance
featuring top notch non-functional requirements.
• Dynamic resource prioritization and workload
management.
41. Understanding Pricing 1/2
• The price-performance metric is dollars per query-hour ($/query-hour).
– This is defined as the normalized cost of running a workload.
– It is calculated by multiplying the rate offered by the cloud platform vendor times the number of computation nodes used
in the cluster and by dividing this amount by the aggregate total of the execution time
• To determine pricing, each platform has different options. Buyers should be
aware of all their pricing options.
• For Azure SQL Data Warehouse, you pay for compute resources as a function
of time.
– The hourly rate for SQL Data Warehouse various slightly by region.
– Also add the separate storage charge to store the data (compressed) at a rate of $ per TB
per hour.
• For Amazon Redshift, you also pay for compute resources (nodes) as a
function of time.
– Redshift also has reserved instance pricing, which can be substantially cheaper than on-
demand pricing, available with 1 or 3-year commitments and is cheapest when paid in full
upfront.
42. Understanding Pricing 2/2
• For Snowflake, you pay for compute resources as a function of time—just
like SQL Data Warehouse and Redshift.
– However you chose the hourly rate based on certain enterprise features you need
(“Standard”, “Premier”, “Enterprise”/multi-cluster, “Enterprise for Sensitive Data” and
“Virtual Private Snowflake”)
• With Google BigQuery, one option is to pay for bytes processed at $ per TB
– There’s also BigQuery flat rate
• Azure SQL Data Warehouse pricing was found at https://azure.microsoft.com/en-us/pricing/details/sql-data-
warehouse/gen2/.
• Amazon Redshift pricing was found at https://aws.amazon.com/redshift/pricing/.
• Snowflake pricing was found at https://www.snowflake.com/pricing/.
• Google BigQuery pricing was found at https://cloud.google.com/bigquery/pricing.
43. Design Your Benchmark
• What are you benchmarking?
– Query performance
– Load performance
– Query performance with concurrency
– Ease of use
• Competition
• Queries, Schema, Data
• Scale
• Cost
• Query Cut-Off
• Number of runs/cache
• Number of nodes
• Tuning allowed
• Vendor Involvement
• Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server)
• Any not-free third party, SaaS, or on-demand software
• Instance type of nodes
• Measure Price/Performance!
37
44. Summary
• Data professionals are sitting on the future of the organization
• Data architecture is an essential organizational skill
• Artificial intelligence will drive the organization for the future
• All need a high-standard data warehouse
• Cloud analytic databases are for most organizational workloads
• Adopt a columnar orientation to data for analytic workloads
• Data lakes are becoming essential
• Use cloud storage or managed Hadoop for the data lake
• Keep an eye on developments in information management and how
they apply to your organization
38
45. Comparing the Enterprise
Analytic Solutions
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444