Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here?
In this webinar, we look at this foundational technology for modern Data Management and show how it evolved to meet the workloads of today, as well as when other platforms make sense for enterprise data.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
ADV Slides: The Evolution of the Data Platform and What It Means to Enterprise Analytic Strategy
1. The Evolution of the Data
Platform and What It Means
to Enterprise Analytic
Strategy
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444
3. Where AtScale Fits in the New Analytics Stack
3
COMPONENT
CONSUMPTION
VISUALIZATION, ANALYSIS,
REPORTING
SEMANTIC LAYER
QUERY ACCESS, FILTERING, MASKING,
AUDITING
PREPARED DATA
DATA PROCESSING, MODELING
RAW DATA
DATA STORAGE, ENCRYPTION
DATA TRANSFORMATION
ETL,MERGING, AGGREGATION
LAYER (FUNCTION)
BI Tools
AI/ML
Tools
Application
s
Multi-dimensional Engine
Data Governance Engine
Virtualization Engine
Data Warehouse
File Access
Engine
ETL Engine
File System (Data
Lake)
Data
Catalog
4. AtScale Proof Points
4
Mitigate Risks Associated with
Data & Analytics
● Tyson Foods migrated from HDP to RedShift in 1 click
● UHG used offshore resources while protecting PII
● Home Depot supports internal & external users
One Consistent/Compliant View
of Business Metrics &
Definitions
● Cigna runs PMPM calculations across all dimensions
● Wayfair migrated to GCP while keeping OLAP
● Kohl’s can now deliver inventory management to LOB
Control Complexity & Cost of
Analytics
● Home Depot serves 10x more analysts for same cost
● BOL.com reduced their BigQuery costs by 91%
● Wayfair increased their data retention by 5x
Accelerate Data Driven
Decisions at Scale
● Koch Industries improved query performance by 40x
● Kohl’s eliminated a 9 month new data prep cycle
● BOL.com delivers 93% of queries in under 1 second
5. AtScale: How We Do It
5
Universal Semantic Layer
✓ Supports SQL and OLAP
✓ One location to define business metrics
✓ Abstracts physical location & format
Intelligent Data Virtualization
✓ Works with “Live” data
✓ Supports on-premise & Cloud
✓ Automates performance management
Autonomous Data Engineering
✓ Automates data engineering tasks
✓ Proactively tunes performance
✓ Reduces & makes cloud costs predictable
Security & Governance
✓ Extends security frameworks to cloud
✓ Inherits security of data platform
✓ Universal row & column level security
6. 6
Business
Intelligence &
Analytics
Tools
Big Data
Platforms &
Engines
API (REST) SQL (JDBC / ODBC)MDX (XMLA)
Visual
Data Modeler
Logical
Security
Layer
Metadata Repository
Semantic Model / Data Catalog / Data Lineage Map
Multi-Dimensional Engine
Virtualized Acceleration Structures
Data Abstraction Layer
Adaptive
Analytics Fabric
Interfaces
AtScale’s Adaptive Analytics Fabric
Distributed Query Engine
New
AI
Engine
Physical Security & Governance Layer
7. Test
Improvement Factor with AtScale
BigQuery Redshift Snowflake Synapse Databricks
Query
Performance1 8x Faster 12x Faster 6x Faster 3x Faster 12x Faster
User
Concurrency2 20x Faster 61x Faster 16x Faster 9x Faster 86x Faster
Improved ROI3 10x Better 2.6x Better 4x Better 2x Better 11x Better
Complexity4 76% less complex SQL queries
TPC-DS 10TB Benchmark:
Improvements with AtScale
7
1. Elapsed time for executing 1 query five times
2. Elapsed time executing 1 (x5), 5, 25, 50 queries
3. Compute costs for cluster time (Redshift, Snowflake) or bytes read (BigQuery) for user concurrency test
4. Complexity score for SQL queries for number of: functions, operations, tables, objects & subqueries (AtScale = 258, TPC-DS = 1,057)
8. William McKnight
President, McKnight Consulting Group
• Frequent keynote speaker and trainer internationally
• Consulted to many Global 1000 companies
• Hundreds of articles, blogs, white papers, field tests, etc.
in publication
• Focused on delivering business value and solving business
problems utilizing proven, streamlined approaches to
information management
• Former Database Engineer, Fortune 50 Information
Technology executive and Ernst&Young Entrepreneur of
Year Finalist
• Owner/consultant: Data strategy and implementation
consulting firm
• 25+ years of information management and data
experience
8
9. McKnight Consulting Group Offerings
Strategy
Training
Strategy
Trusted Advisor
Action Plans
Roadmaps
Tool Selections
Program Management
Training
Classes
Workshops
Implementation
Data/Data Warehousing/Business
Intelligence/Analytics
Master Data Management
Governance/Quality
Big Data
Implementation
9
11. Data is Under Management when it is…
• In a leveragable platform
• In an appropriate platform for its profile and
usage
• With high non-functionals (Availability,
performance, scalability, stability, durability,
secure)
• Data is captured at the most granular level
• Data is at a data quality standard (as defined by
Data Governance)
11
12. AI Data
• Call center recordings and chat logs
• Streaming sensor data, historical maintenance records and
search logs
• Customer account data and purchase history
• Email response metrics
• Product catalogs and data sheets
• Public references
• YouTube video content audio tracks
• User website behaviors
• Sentiment analysis, user-generated content, social graph data,
and other external data sources
12
18. Cloud Storage
Data Scientist Workbench and Data Warehouse
Staging
OLTP
Systems
Data Lake
Data Scientists
ERP
CRM
Supply
Chain
MDM
…
Data
Warehouse
Data Mart
Stream or
Batch
Updates
DI
Real-Time,
Event-Driven
Apps
18
19. Graph Databases
Bridge
vertex
Bridge
vertex
19
• Subject: John R Peterson Predicate: Knows Object: Frank T Smith
• Subject: Triple #1 Predicate: Confidence Percent Object: 70
• Subject: Triple #1 Predicate: Provenance Object: Mary L Jones
21. Best Category and Top Tool Picked
Best Category Picked
Top 2 Category Picked
The No Decision Platform
80%
70%
60%
50%
Increasing Probability that Platform
Selection Leads to Success
22. 3 Major Decisions
• Decision #1: The Data Store Type
– The largest factor for distinguishing between databases and file-based scale-out system utilization is the
data profile. The latter is best for data that fits the loose label of 'unstructured' (or semi-structured)
data, while more traditional data -- and smaller volumes of all data -- still belong in a relational
database.
• Decision #2: Data Store Placement
– You must also decide where to place your data store -- on-premises or in the cloud (and which cloud). In
the past, the only clear choice for most organizations was on-premises data. However, the costs of scale
are gnawing away at the notion that this remains the best approach for a data platform. For more on
why databases are moving to the cloud, please read this article.
• Decision #3: The Workload Architecture
– Finally, you must keep in mind the distinction between operational or analytical workloads. Short
transactional requests and more complex (often longer) analytics requests demand different
architectures. Analytics databases, though quite diverse, are the preferred platforms for the analytics
workload.
(and Price)
22
23. What is the Data Platform for?
• Operational Database
• Operational Real-Time
• Operational Big Data
• Operational Data Hub
• Master Data Management
• A Data Warehouse
• A Dependent Data Mart
– Dependent
– Independent
• A Data Lake
• Analytic Big Data Application
• Archive Storage
• A Staging Area
23
24. Analytics Reference Architecture
Logs
(Apps, Web,
Devices)
User tracking
Operational
Metrics
Offload
data
Raw Data Topics
JSON, AVRO
Processed
Data Topics
Sensors
and
/ or
Transactiona
l/ Context
Data
OLTP/ODS
ETL
Or
EL with
T in Spark
Batch
Low
Latency
Applications
Files
In-
database
analytics
Reach
through
or ETL/ELT
or
Stream
Processing
or
Stream
Processing
Q
Q
Data
Warehouse
25. Data Warehousing
• Data Warehouses (still) have a lower
total cost of ownership than data
marts
• A data warehouse is a SHARED
platform
– Build once, use many
– Access at Data Warehouse
– Access by creating a mart off the DW
• Still A LOT cheaper than building from scratch
“… a subject-
oriented, integrated,
non-volatile, time-
variant collection of
data, organized to
support
management
needs.” — Bill Inmon
26. Data Warehouses Have Flavors
● The Customer Experience Transformation Data Warehouse focuses on
customer attributes and touchpoints to improve the value of
customers.
● The Asset Maximization with IoT data warehouse deals with the high
volume of edge data tracking the physical assets of the organization.
● The Operational Extension Data Warehouse supports company
operations directly with real- time analytics.
● The Risk Management Data Warehouse supports the ever-growing
compliance and reporting requirements and corporate risk.
● The Finance Modernization Data Warehouse handles the voluminous
financial reporting and ensures the bottom line is considered in every
aspect of the business.
● The Product Innovation Data Warehouse delivers all product-related
information into the decisions of the product life cycle.
28. • Azure SQL Data Warehouse is scaled by Data Warehouse Units (DWUs)
which are bundled combinations of CPU, memory, and I/O. According to
Microsoft, DWUs are “abstract, normalized measures of compute
resources and performance.”
• Amazon Redshift uses EC2-like instances with tightly-coupled compute and
storage nodes which is a “node” in a more conventional sense
• Snowflake “nodes” are loosely defined as a measure of virtual compute
resources. Their architecture is described as “a hybrid of traditional
shared-disk database architectures and shared-nothing database
architectures.” Thus, it is difficult to infer what a “node” actually is.
• Google BigQuery does not use the concept of a node at all, but instead
refers to “slots” as “a unit of computational capacity required to execute
SQL queries,” which is also a vague and abstract concept to the average
user.
What is a Node?
29. • For many, you pay for compute resources as a function of time
– The hourly rate can vary slightly by region
– You may also choose the hourly rate based on certain enterprise features
you need
– Also, you need to add the separate storage charge to store the terabytes of
data (compressed)
• This is also expressed as per hour, although it is substantially less than
compute
• Alternatively, some cloud vendors have consumption-based pricing
models, where instead of paying by the hour, you pay by the byte
processed
– You would multiply the terabytes of data by the on-demand dollars per
terabyte pricing
– There is also a cost-per-hour flat rate where you would need to calculate
how long it would take to run your queries to completion
Costing the Platform
30. • Autonomous Administration
• Lack of Platform Features Leads to Increased Configuration
and Management
– stored procedures, referential integrity and uniqueness capabilities
– mission critical options for backup and disaster recovery, which
typically includes a standby database
– full ANSI-SQL compliance
• Performance
Total Cost of Ownership is More Than Just Cloud
Costs
31. • If an additional identical cluster is deployed to handle the additional user queries, the cost doubles
for the time period the additional cluster is up and running
Pricing Gotchas: Scale Out Impact on Cost
32. • Whenever a data warehouse does not have enough memory
to build a join hash table and keep it in memory, it has to spill
it to disk
– This is costly in terms of performance, because the DBMS has to do
double work writing, sorting, and reading the hash table information
all on disk—rather than in memory
• If you want to provision a medium-sized cluster and let it scale
up to two medium clusters during the busy hours to handle
the higher concurrency, a large JOIN would spill to disk on one
of the clusters
Pricing Gotchas: Memory Pressure on Scale Out
Compute
33. • In the cloud, storage is offered
by media type—solid state
(SSD) or spinning hard drive
(HDD)
• SSD is also offered in tiers
– Azure has Standard and Premium
– AWS has general purpose (gp2)
and provisioned (io1) SSD tiers.
Pricing Gotchas: Appropriate I/O is Expensive
35. Data Virtualization
“The right answer is not
always to centralize the
data. Data Virtualization
will be of utmost
importance as the
‘perpetual short-term’
solution to the need.”
35
Data Warehouses
Marts & Cubes Operational
Data Stores Transactional
Sources
File Systems
Big Data
Enterprise Data
Virtualization
36. Capabilities for Data Integration for Enterprise
Data
• Comprehensive Native Connectivity
• Multi-Latency Data Ingestion
• Data Integration (in ETL, ELT, Streaming)
• Data Quality and Data Governance
• Data Cataloging and Metadata Management
• Enterprise Trust, Enterprise Scale (or Class)
• AI Intelligence and Automation
• Ecosystem and Multi-cloud
38. Data Integration Options
Project Technical Environments Recommended For
Consideration Project Scope
Heterogenous:
Cloudera Any Any
IBM Any Any
Informatica Any Any
Talend Any Any
Specialist:
AWS (Glue) Environments on AWS with core of Redshift, EMR Any
Azure (Azure Data Factory) Environments on Azure with core of Synapse, HDInsight Any
FiveTran Any Contained scope
Google Environments on GCP with core of BQ, DataProc Any
Matillion Any Contained scope
Oracle Environments with Oracle database Any
SAP SAP-only environments SAP projects
39. The Evolution of the Data
Platform and What It Means
to Enterprise Analytic
Strategy
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444