Organizations have been collecting, storing, and accessing data from the beginning of computerization. Insights gained from analyzing the data enable them to identify new opportunities, improve core processes, enable continuous learning and differentiation, remain competitive, and thrive in an increasingly challenging business environment.
The well-established data architecture, consisting of a data warehouse, fed from multiple operational data stores, and fronted by BI tools, has served most organizations well. However, over the last two decades, with the explosion of internet-scale data, and the advent of new approaches to data and computational processing, this tried-and-true data architecture has come under strain, and has created both challenges and opportunities for organizations.
In this green paper, we will discuss modern approaches to data architecture that have evolved to address these challenges and provide a framework for companies to build a data architecture and better adapt to increasing demands of the modern business environment. This discussion of data architecture will be tied to the Data Maturity Journey introduced in EQengineered’s June 2021 green paper on Data Modernization.
2. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
2
Table of Contents
Executive Summary ....................................................................................................................3
Introduction.................................................................................................................................4
An Evolutionary Journey of Enterprise Data Architecture...........................................................5
The Enterprise Data Warehouse ......................................................................................5
Key Technological Trends and Architectural Shifts .........................................................6
NoSQL Databases and Polyglot Persistence...................................................................7
Data Lake and ELT...........................................................................................................9
Patterns for Building a Modern Data Architecture....................................................................11
Architectural Pattern for Descriptive & Diagnostic Stage Analytics...............................13
Architectural Pattern for Predictive Stage Analytics ......................................................14
Architectural Pattern for Prescriptive Stage Analytics ...................................................15
Conclusion................................................................................................................................15
3. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
3
Executive Summary
Organizations have been collecting, storing, and accessing data from the beginning of
computerization. Insights gained from analyzing the data enable them to identify new
opportunities, improve core processes, enable continuous learning and differentiation, remain
competitive, and thrive in an increasingly challenging business environment.
The well-established data architecture, consisting of a data warehouse, fed from multiple
operational data stores, and fronted by BI tools, has served most organizations well.
However, over the last two decades, with the explosion of internet-scale data, and the advent
of new approaches to data and computational processing, this tried-and-true data
architecture has come under strain, and has created both challenges and opportunities for
organizations.
In this green paper, we will discuss modern approaches to data architecture that have
evolved to address these challenges and provide a framework for companies to build a data
architecture and better adapt to increasing demands of the modern business environment.
This discussion of data architecture will be tied to the Data Maturity Journey introduced in
EQengineered’s June 2021 green paper on Data Modernization.
4. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
4
Introduction
Data storage has always been an integral part of computer technology. In the early days of
commercial computing, the technology for storing and retrieving data was clunky and slow,
and required users to become experts. In the 80s, with the introduction of relational
databases, data storage and access started becoming easier for even non-experts to use.
A major transformation in the way organizations use data came about with the data
warehousing paradigm in the early 90s. Business leaders could now access business
intelligence dashboards from a centralized data store, fed from multiple operational data
systems. Up to this point, the managing the various components of the data architecture was
relatively straightforward for most organizations.
Over the last two decades, several transformative technology trends like the following have
emerged in the business landscape causing foundational changes to business norms:
• Internet-scale data generation and its impact on data volume, velocity, and variety,
• The rise of cloud computing and its impact on scale and availability, and
• Adoption of machine learning and artificial intelligence techniques for business analytics.
For companies to succeed in this transformed landscape, adoption of new technologies and
architectures that align with strategic business priorities is imperative.
5. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
5
An Evolutionary Journey of Enterprise Data Architecture
Today, most organizations implement a variation of the following generic data processing flow
and supporting processes and infrastructure:
The Enterprise Data Warehouse
The dominating data processing architecture of the 90s was the one that integrated
operational systems with the enterprise data warehouse (EDW). The EDW forms the core
component of business intelligence (BI), serving as a central repository of data from multiple
sources, both current and historical.
This architecture enables an enterprise-wide capability for analytics and reporting, thus
allowing organizations to respond quickly to changing business opportunities. For traditional
organizational data-processing needs, this architecture can satisfy quality requirements like
Generation
• Business data
• Operational
data
Ingestion
• Extraction
• Transformation
Storage
• Repositories for
query and
analytics
Analytics
• Tools for
running queries
and analytics
Consumption
• Reports
• Dashboards
• APIs
Data Processing Stages
Generation Ingestion Storage Analytics Consumption
Data Processing Stages
Data Flow Pipeline &
Workflow
Quality &
Security
Performance &
Scalability
Disaster Recovery &
Failover
OLTP Databases
Applications, CRMs,
ERPs
APIs, Flat-files
ETL processes Data Warehouse Query Tools
BI Tools
Reports/Dashboards
APIs
6. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
6
data integrity, consistency, availability, and failover. The flexibility of this architecture has also
served well in the migration from on-premise installations to cloud.
Key Technological Trends and Architectural Shifts
The introduction of newer technologies and platforms like internet-scale systems, migration to
cloud computing, ML/AI based analytics, IoT (Internet of Things), and edge devices like smart
phones have necessitated fundamental shifts to the data architecture to meet the enhanced
requirements for scalability, consistency, and disaster recovery.
The table below lists how enterprise data components are impacted due to this shift.
Stage From
(Key Attributes)
To
(Key Attributes)
Rationale
Generation • On-prem mostly
relational
databases
• Structured data
• Batch
• Cloud-based
Relational and
NoSQL databases
• Structured &
unstructured data
• Real-time & batch
• Moving to the cloud to take
advantage of the increased
flexibility, and scalability
• Polyglot persistence (see
below) uses the most suitable
database technology for the
task
Ingestion • On-prem ETL
tools
• Structured data
• Batch
• Cloud-based ELT
tools
• Structured &
unstructured data
• Real-time & batch
• ETL processes are being
replaced by more flexible and
performant ELT ones (see
below)
Storage • On-prem
centralized Data
warehouse
• Cloud-based Data
warehouse
• Distributed
storage
• Data Lake
• Increased flexibility in data
collection from multiple
sources, scale, and analytics
Analytics • Historical &
prescriptive
• Batch
• Data mining
• Prescriptive &
Predictive
• Real-time & batch
• AI/ML models
• Moving from prescriptive to
predictive analytics
• Use of AI/ML approaches
Consumption • Dashboards &
Reports
• Centralized
administration
• AI/ML model
driven insights
• Self-serve
• Analytics now available to non-
technical users through self-
serve portals
7. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
7
NoSQL Databases and Polyglot Persistence
Relational database technologies are well-suited for enterprise transaction processing with
their ACID—atomicity, consistency, isolation, and durability—guarantees, which ensure data
validity despite errors and failures. To implement these ACID properties, relational databases
make some tradeoffs with regards to scalability and availability by assuming a single instance
deployment, with tools for replication and disaster recovery. Handling a larger volume of data
simply means running the database on a larger server.
Concerns about data consistency and availability were not part of mainstream enterprise
architectural thinking before the confluence of several technological shifts like the widespread
adoption of cloud computing to handle the increasing volume and variety of unstructured data
serving a globally distributed user base. Companies quickly realized that simply moving a
relational database to a larger server is not able to meet the scalability and availability
demands of the business. Organizations need to think about distributing the load across
multiple database instances. However, relational databases are not designed to guarantee
availability in the face of network failures that is inherent in a distributed system.
In addition, modern Internet based applications generate a wider variety of unstructured data
that can consist of images, audio and video data, chat transcripts, application logs, social
media interactions, and other content that is difficult for relational technologies to handle.
A new class of database technologies, collectively known as NoSQL databases, better suited
for distributed operations, and storing various forms of unstructured data, has become
popular in the last decade.
8. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
8
The following table lists the different types of data and access patterns for which NoSQL
databases are more appropriate.
Database
Type
Strengths Weaknesses Use cases Tools
Key-Value Scalability,
Availability,
Fast read/write
access
Limited query and
ACID capabilities;
Not able to
update data
partially
User sessions;
Chat data; Caching
Memcached,
Redis, Riak, AWS
DynamoDB,
Azure CosmosDB
Document Flexible data
model,
Distributed
deployment
Schema-less
design may
impact
maintainability;
Not able to
update data
partially; Limited
SQL-like query
capabilities
IoT data capture;
Product catalogs;
Content
management
MongoDB,
CouchDB, AWS
DynamoDB,
Azure CosmosDB
Column Scalability,
Availability,
High read/write
performance,
Distributed
deployment
Some has limited
SQL-like query
capabilities
Catalog searches;
Time series data
Cassandra,
HBase, AWS
Keyspaces,
Azure Cosmos
DB Cassandra
API
Graph Network or
hierarchical
relationships
Limited
scalability;
Limited
transactional
capabilities
Social
relationships;
Hierarchies;
Recommendations;
Fraud detections
Neo4j, AWS
Netpune, Azure
Cosmos DB
Gremlin API
An organization may opt for a mix of different databases, with relational databases providing
the core and critical transactional components of a system, while one or more of the NoSQL
databases providing support for different use cases. This trend of picking the right database
technology for the right use case has become known as polyglot persistence.
9. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
9
Data Lake and ELT
The needs of traditional analytics are well-met by the data warehouse architecture discussed
above. Interactive BI dashboards, visualization tools, and reports can describe what is
happening in a business, and answer questions like “What Happened?” and “Why did
Something Happen?” These tools mostly deal with structured data, well-defined at source,
and ingested through batch processes.
The ETL (Extract-Transform-Load) pattern has been the workhorse of ingesting data from
diverse sources into the data warehouse for a long time.
The ETL pattern includes the following steps:
• Extract: data is copied from different source systems to a staging area, typically in batch
mode.
• Transform: data cleansing, enrichment, and transformation is done in preparation for
loading into the data warehouse
• Load: the transformed data is loaded into the data warehouse tables
Although the ETL approach is good for bulk data movement, it has a few limitations in the
new world of internet scale, real-time data, and AI/ML based analytics, including:
• It is primarily batch-oriented and not suitable for real-time data
• EDW schemas are difficult to change and adapt to changing source data structures
• Data that does not have a schema defined for it cannot be loaded into the EDW and is not
available to the analytics team
10. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
10
• Since in this model the EDW serves as a repository of historical data, the data that cannot
be loaded into the EDW is lost
• For the data that is loaded into the EDW, its lineage information is not preserved
• Because of the dependency on an EDW schema, unstructured and binary data are difficult
to load
The ELT (Extract-Load-Transform) is a variation on the ETL approach, and it consists of the
following steps:
• Extract and Load: data is copied from different source systems to a centralized
repository or staging area in its raw form, sometimes called a data lake
• Transform: data from this central repository is moved after cleansing, enrichment, and
transformation to the most appropriate location for analytics, which may be a data
warehouse, or even a NoSQL database.
The advantages of this approach include:
• Data is always maintained in its raw form in the data lake, preserving history and lineage
information
• Data from diverse sources—real-time, and streaming, structured, unstructured, and
binary—can all be stored in the data lake
• Data in the data lake can be processed after the fact when a need arises
• Analytics tools can access data directly from the data lake even if it is not available in the
data warehouse
11. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
11
Patterns for Building a Modern Data Architecture
In an earlier green paper—Building an Effective Data & Analytics Operating Model—we
presented a data maturity journey:
As an organization plans on a data modernization initiative, selecting the appropriate set of
technologies and tools from the breathtakingly diverse and complex choices may appear
quite daunting. To select an architecture best suited for their specific needs, it helps to think
in terms of the maturity journey, the appropriate use cases, and corresponding patterns of
data architecture.
Maturity Level Target Use Cases
Descriptive & Diagnostic • Reporting, dashboards, ad-hoc analysis
• Multiple data sources, both structured and some unstructured
• Third-party data sources from SaaS vendors
• Monolithic applications with centralized data stores
Predictive • Business intelligence and occasional operational AI/ML
• Streaming data
• Diverse data types (including text, images, and video)
• Microservice based applications with federated data stores
12. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
12
Prescriptive • All the above. and
• AI/ML driven, real-time capabilities for both internal and
external users
The important thing to note here is the recommendation to migrate to cloud-based data
platforms which offers several advantages over using traditional on-prem technologies,
including that it is easier to get started, scale, and mature.
Here is a list of popular cloud-based technologies used in the modern data architecture stack:
Technology Categories Cloud Products
NoSQL databases See above
Data warehouse Snowflake, AWS Redshift, Azure SQL Data Warehouse
Data lake Databricks Delta Lake, AWS S3, Azure Data Lake (Hadoop)
Stream processing Kafka, AWS Kinesis, Azure Event Hub
ETL/ELT Fivetran, dbt, Airflow
BI tools Tableau, Power BI, Looker
Data science and AI/ML
platforms
Databricks Spark ML, AWS SageMaker, Azure ML
MLOps MLFlow, Seldon, Kubeflow, Apache Airflow, AWS SageMaker,
Azure ML
13. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
13
Architectural Pattern for Descriptive & Diagnostic Stage Analytics
The architectural pattern for descriptive and diagnostic stage analytics is appropriate for
organizations of all sizes which are at the initial milestones of their analytics journey. It uses an
architecture quite similar to that of the traditional EDW architecture, with a few
enhancements.
The architecture includes the
• Ability to handle multiple data sources, both structured and some unstructured
• Use of cloud-based technologies: relational databases for storing operational data, data
warehouse, NoSQL data stores
Because of the similarity of this architecture to traditional EDW setups, it is easier to get
started. Most importantly, however, this architecture lays the ground work for moving to the
subsequent maturity levels of the data journey without requiring a major rearchitecting effort.
Generation Ingestion Storage Analytics Consumption
OLTP Databases
Applications, CRMs,
ERPs
APIs, Flat-files
ETL processes Data Warehouse Query Tools
BI Tools
Reports/Dashboards
APIs
NoSQL Stores
Logs
NoSQL Stores
14. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
14
Architectural Pattern for Predictive Stage Analytics
The pattern for predictive analytics is the next milestone in the data maturity journey and is
appropriate for organizations with more complex data needs, including support for more
diverse data types, streaming data, and support for both operational and analytics use cases.
The salient components of this patterns include the:
• Use of the ELT workflow and a cloud-based data lake to handle the diversity of source
data
• Use of prepackaged AutoML platforms for a relatively simpler approach to predictive
analytics
Generation Ingestion Storage Analytics Consumption
OLTP Databases
Applications, CRMs,
ERPs
APIs, Flat-files
ELT workflow Data Warehouse Query Tools
BI Tools
Reports/Dashboards
APIs
NoSQL Stores
Logs
NoSQL Stores
Event Collectors
Stream Processing
Data Lake Data Science Tools Embedded Analytics
15. P.O. Box 211 Hampton Falls, NH 03844 | info@eqengineered.com | eqengineered.com | 617.448.4255
15
Architectural Pattern for Prescriptive Stage Analytics
At the end stage of the data maturity journey, the pattern for prescriptive analytics supports a
robust MLOps workflow, with exploration, testing, deployment, and monitoring, with AI/ML as
a core capability.
This architecture is both powerful and flexible to address the most sophisticated analytics
needs of an organization.
Conclusion
Harnessing the capabilities of a mature data and analytics practice will allow organizations to
create significant value and differentiate themselves for their competitors. There is a wide
choice of tools from which to build a modern data architecture stack.
Though there is no one-size-fits-all approach, the above framework assists an organization to
build a robust and adaptable data architecture aligned to their strategic and business
imperatives.
Generation Ingestion Storage Analytics Consumption
OLTP Databases
Applications, CRMs,
ERPs
APIs, Flat-files
ELT workflow Data Warehouse Query Tools
BI Tools
Reports/Dashboards
APIs
NoSQL Stores
Logs
NoSQL Stores
Event Collectors
Stream Processing
Data Lake Data Science Tools Embedded Analytics
AI/ML Platform
Exploration & Training Deployment Operationalization
Labelling
Feature Engineering
Model Training & Tuning
Scaling
Testing
Model Versioning
Data Versioning
Monitoring & Alerting
Model Tracking
Model Tracking