The document provides an overview of modern data architectures, including data lakes, data warehouses, data lakehouses, and data meshes. It discusses the challenges of big and diverse data, as well as empowering teams through decentralized approaches. The key considerations in determining a data strategy are understanding your use cases and data types, empowering both technology and people, and removing barriers to insights. Starting points may be strategic, focusing on goals, or tactical, focusing on immediate needs.
Boost Fertility New Invention Ups Success Rates.pdf
Got data?… now what? An introduction to modern data platforms
1. Got data… now what?
An introduction to modern data architecture
Susan Pierce
Product Manager, Data Analytics
Google Cloud
2. ● Industry context
● History lesson: data warehouses and data lakes
● Data lakehouse
● Data mesh
● Data vault
● Considerations in determining a data strategy
Topics
5. Organizations see data and AI/ML potential
Technology is Altering the Way Organizations Operate
but are struggling to make it a reality
Big data/analytics 34%
Artificial intelligence / machine learning 30%
Cloud infrastructure
Identity and access management (IAM)
Cloud databases
SaaS
Internet of Things / M2M
Security Orchestration Automation and
Response (SOAR)
Serverless computing
Next-generation WiFi
30%
19%
17%
14%
11%
10%
9%
9%
Q: Which of these technologies has the most potential to significantly
alter the way your business operates over the next 3 to 5 years?
CIO Magazine, Oct 2021
10%
of organizations achieve significant
financial benefits from AI
Boston Consulting Group
Are You Making the Most of Your Relationship with AI?, Oct 2020
6. Challenge #1
Data is big and
multi-format.
● Structured and unstructured
● Real-time streams and at-rest
● Across clouds and on-premise
181ZB
expected by 2025
- STATISTA, FEB 2022
7. Challenge #2
Data requires more
than SQL.
● Machine learning & AI
● Stream analytics and events
● Data-driven applications
75%
of enterprises will shift from
piloting to operationalizing
artificial intelligence
Gartner ®, Streaming Analytics in the
Cloud: A Comparative Analysis of
Amazon, Microsoft and Google, Sumit
Pal, Shaurya Rana, 14 December, 2021
8. 73%
of data leaders feel
real-time access to data is
extremely important
Challenge #3
Data reaches
everyone.
● Mission critical
● Accessed by everyone
● Shareable asset
- HBR, 2021
9. Proprietary + Confidential
68%of companies are unable to realize
measurable value from data.
More data copies
Data is big and multi-format.
Data requires more than SQL.
Data reaches everyone.
More tech islands
More integrations
High costs
Low productivity
Limited access
More capacity
More security risk
More data silos Constant Capacity Planning
Data unavailable
Poor SLAs
Unclear compliance
Accenture, Closing the Data Value Gap, 2019
10. Proprietary + Confidential
The data warehouse
● Timeframe: 1980s/1990s
● Purpose: Bridge the gap between operational data and business intelligence
● Data type(s): Structured, cleaned data with a known schema. Operational data is
aggregated, cleaned/pre-processed, and inserted into the data warehouse in batches
● Good for: Forcing consistency (quality and integrity), querying known dimensions, large
queries
● Typical users: “The Business”
● Other info: Can store data from multiple source databases and don’t require a strict 1-1
mapping with transactional databases
● A data mart is a smaller, application-specific data warehouse, usually tied to specific
team or line of business (example: marketing specific data)
11. Proprietary + Confidential
Datamarts
(logical)
Data warehouse - single integrated repository of atomic data
Inmon – Enterprise Data Warehouse Kimball – Dimensional Data Warehouse
Data sources ETL
EDW
Datamart Datamart
Datamart
Application
Application
Data sources ETL EDW
BI tool
BI tool Application
12. Proprietary + Confidential
Legacy systems bust customer budgets
and it’s a hassle to renew the license.
Maintaining the operations of EDW
takes too much time.
Doesn’t support machine learning and AI
initiatives as well as streaming use cases.
Data is not fresh or current enough. Systems can’t keep up with forecasted
usage and data growth. It’s hard to scale
on compute or storage on-demand.
Cost challenges Modernization challenges
Scaling challenges
Data freshness
Data warehouses are often painful to manage
13. Proprietary + Confidential
BigQuery Architecture
NDA
SQL:2011
Compliant
Petabit Network
High-Available Cluster
Compute
(Dremel)
Streaming
Ingest
Free Bulk
Loading
Replicated, Distributed
Storage
(high durability)
REST API
Client
Libraries
In 7
languages
Web UI, CLI
Distributed Memory
Shuffle Tier
Decoupled storage and compute for maximum flexibility.
Storage API
BigQuery
BI Engine Compute
(Stateful workers)
BQML
14. Proprietary + Confidential
The data lake
● Timeframe: 2000s/2010s
● Purpose: Storage for data of any type in its native format without pre-processing
● Data type(s): Structured, unstructured, semi-structured in a flat architecture, ingested by
streaming, micro batch, or batch
● Good for: Flexibility across use cases, high volumes of data, data exploration,
granular/low level data
● Typical users: Data scientists
● Other info: Optimized for lower storage cost (cheaper hardware, open source tools), and
considered highly configurable because data is not restricted to a set schema
● Data lakes are usually queried in a programmatic fashion due to the large amounts of
high-variability data they contain
16. Proprietary + Confidential
On-premises data lakes are struggling to deliver value
Resource utilization and overall TCO
of on-premises data lakes becomes
unmanageable.
Data governance and security issues open
up compliance concerns.
Resource intensive data and analytics
processing can lead to missed SLAs.
Analytics experimentation is slow due
to resource provisioning time.
TCO challenges
Agility challenges
Scaling challenges
Governance challenges
17. Data Warehouse
● Schema on write
● Difficult to change
● Structured data
● Strong business context
● Batch ingestion
● Inherent security
● Schema on read
● Easy to change
● Raw data
● Geared for exploration
● Supports streaming
● Difficult to govern
Data Lake
18. Proprietary + Confidential
The big big data decision: data warehouse or data lake?
Use case
characteristics
Understanding your business
Data Warehouse
(TB scale)
Answer “known” questions
Access “known” data
Structured data
SQL access and manipulation
Data Lake
(PB scale)
Answer “unknown” questions
Access “unknown” data
Unstructured (raw) and structured data
Code-involved access and exploration
Exploring your business
Data type
and access
Paper: Build a modern, unified analytics data platform; Blog: Bringing data lakes and data warehouses together
19. Proprietary + Confidential
Security
Connecting Tools
Standardization
Financial Governance
Metadata
Security
Governance
Metadata
Security
Governance
Metadata
Security
Governance
Metadata
Security
Governance
Databases
Data Marts
Data Lakes
Data Warehouses
Transactions
Unstructured
Files
Logs
LoB Specific
Data
Real-time App
Data
BI
Machine
Learning
ETL/ ELT Tools
Consumer
App
Data Silos
With more data comes more responsibility
20. Proprietary + Confidential
80% of analytics work is
still descriptive”
MIT, 2020
Maturity
● Teams: analysts vs. engineers
● Model: self service vs.
centralized
● Technology: BI, AI, and data
fabric
Silos
● Multiple clouds, on-premises
legacy
● No consistent governance
● Duplication: data and
definitions
90% of employees say that
their work is slowed by
unreliable data sources”
Dimensional Research, 2020
Complexity
● Volume, velocity, and variety
● Data marts, EDW, data lakes, and
lake houses
● Experimentation and production
86% of analysts struggle with
data that's out of date”
Dimensional Research, 2020
Closing the data/value gap
21. Proprietary + Confidential
Empowering technology and people
Data lakehouse removes overhead of data lakes and
data warehouses
Data warehouse gets the capabilities of the data
lakes
Data lake gets the capabilities of the data
warehouses
Provides:
● Multimodal data access with higher volumes of
data
● Schema on read
● The governance that data lakes lacks but data
warehouses provide
Data mesh removes the organizational barriers becoming
the bottleneck
● Emphasizes on data Domain/Team, then technology,
● Agile teams/more insights
Teams own their data and technology
● Provides API access to others teams
● Decentralized raw and processed data
Provides:
● Well defined, governed, and secured data meshes
● Still able to leverage several domains with no data
movement
Enabling data Enabling teams
23. Proprietary + Confidential
Data Warehouse
/ Data Lake
Cloud
Storage
Object Storage
(low-cost semi-unstructured data store)
Structured Storage
(highly optimized analytical store)
Spark Streaming
SQL
Beam
Users
User Experience
Batch
Data lakehouse building blocks
Consistent User
experience
Choice of
processing and
analytic engines
Decoupled data
storage
Data Warehouse
/ Data Lake
Cloud
Storage
Automated Data Discovery
Unified Permissioning
Integrated Data Catalog
Centralized
Management
AI/ML
24. Proprietary + Confidential
Data Warehouse
/ Data Lake
Cloud
Storage
Cloud Storage
(low-cost semi-unstructured data store)
BigQuery Storage
(highly optimized analytical store)
Dataproc
Spark, Flink,
Presto, Hive
BigQuery
SQL
Data
Fusion
Beam
Users
User Experience
Vertex AI
Google Cloud Data lakehouse building blocks
Consistent User
experience
Choice of
processing and
analytic engines
Decoupled data
storage
Data Warehouse
/ Data Lake
Cloud
Storage
Dataplex
(Governance, Management)
Data Catalog
(Discovery, Search, Metadata)
Centralized
Management
Dataflow …..
27. Proprietary + Confidential
Distributed ownership
Federated data domain teams
are responsible for maintaining
their data and making it useful
for others, and are provided the
freedom to choose the best
course of action.
Focus on value of data
Teams are incentivized to
maximize the value of their data
products while staying within
policies, and making their data
available and useful for others.
Central support
Central data platform team supports
distributed data domains with tooling,
processes, standards, and guardrails,
and automation in policy enforcement.
Data mesh enables a successful data culture
Through the provision of:
29. Five targeted outcomes
1. Value of data is measured and recognized
2. Teams empowered to generate value from data
3. No central bottleneck
4. Each domain equipped with relevant skills and knowledge
to be successful
5. Distributed ownership for data governance
33. ● The Data Model can grow by adding new links and from there new HUB/SAT
○ Flexible changes with less adjustments ETL and reporting (minimal impact
analysis)
○ Parallel loading
○ Minimal regression testing
What is Data Vault? “add-only” model
Rules
● No direct connections between Hubs
● Hubs don’t reference other entities
● Links reference Hubs
● Satellites reference Hubs or Links
34. Data Vault modelling - how does it work?
A Data Vault model consists of 3 basic entity types
○ The Hub is a business object and separates the business keys from the rest of the model.
No data (!), no relationships -> no reasons to change.
○ The Link stores relationships between hubs (using the business keys).
It is modelled as a many-to-many relationship.
○ Satellites store the context (the attributes of a business key or relationship).
Can be added without changing Hubs and Links.
35. Identity and Access Management
Scheduling, Logging & Monitoring
Version Control, Continuous Integration
Source Systems
ELT SQL SQL SQL
Check
datatypes
Semantic
integration
hard business
rules
Soft business
rules
Flexible require-
ments
Driven business
domains
Staging Raw Vault Business Vault Information Mart
Presentation Layer
Public
data
Other clouds
sales customer supplier
Data Lake
Relational databases, Legacy systems
Streaming data
No SQL Databases
Staging area Raw vault
Data vault
Operational vault
Metrics vault
Metadata
Data Governance, Data Life Cycle Management
Information
marts
Report
collection
Meta Mart
Metrics Mart
Error Mart
Sheets
Flat files
Cubes
37. Proprietary + Confidential
Paper: Build a modern distributed Data Mesh with Google Cloud
Blog: Building a Unified Analytics Data Platform
Paper: Build a modern, unified analytics data platform with Google Cloud
Blog: Data lake and data warehouse convergence
Paper: Converging Architectures: Bringing Data Lakes and Data Warehouses Together
Blog: Data driven transformation using Google's unified analytics platform
Paper: What type of data processing organization are you?
Blog: Open data lakehouse on Google Cloud
Paper: Building a data lakehouse on Google Cloud Platform
Blog: Building the data science driven organization
Blog: Building the data engineering driven organization
Blog: Building the data analyst driven organization from the first principles
Blog: Announcing BigQuery Migration Service
Further reading