This EMC perspective provides an overview of the EMC Data Warehouse Modernization offering. It describes four tactics that can be implemented quickly, using an organization's existing skill sets, and rapidly show a return on investment.
Streamlining Python Development: A Guide to a Modern Project Setup
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
1. USE BIG DATA TECHNOLOGIES TO
MODERNIZE YOUR ENTERPRISE
DATA WAREHOUSE
Most organizations’ enterprise data warehouses were built with online transaction
processing (OLTP)-centric technologies and architectures that are 15-20 years old.
Over the years, more data has been bolted on to these systems, and the query
load being driven by both traditional and mobile business intelligence products has
increased exponentially, resulting in brittle, over-burdened, costly data warehouses
that can take hours to return results. They don’t meet the growing data appetite of
the business, and don’t answer the questions needed to run the business at the
required levels of granularity, or at the necessary speed. Yet too much has been
invested in them to simply throw them out.
Big Data market dynamics have resulted in the creation of new technologies,
products, and approaches that can be used to modernize these stodgy, inflexible
data warehouses, and make them more responsive to the business—without
throwing out what is already in place. This paper describes four tactics that can be
implemented quickly, using an organization’s existing skill sets, and that can
EMC PERSPECTIVE rapidly show a return on investment.
2. TACTIC #1: ACCELERATE YOUR DATA
WAREHOUSE WITH MPP-BASED
ARCHITECTURES
Massively Parallel Processing (MPP)-based databases provide a cost effective, scale-
out data warehouse environment that allows organizations to leverage Moore’s Law1
on performance-to-cost ratio improvements in x86 processors. MPP databases provide
BENEFITS a non-intrusive analytical platform/data warehouse for data discovery and exploratory
work on massive amounts of data. Built on inexpensive commodity clusters, MPP
Leverage more detailed,
databases can extend, complement, or replace parts of your existing data warehouse,
more robust dimensional
managing massive volumes of detailed data, while providing agile query, reporting,
data
dashboards, and analytics (see Figure 1).
• Seasonality to forecast
retail sales and energy MPP databases, while offering many of the same benefits as your existing data
consumption warehouse, also provide the following advantages:
• Localization to pin point • Extreme scalability on general purpose systems
lending or fraud exposure • Automatic parallelization
• Hyper-dimensionality for • Ability to load and query like any other database
digital media attribution
• Scanning and processing of all nodes in parallel
or health care treatment
analysis • Extreme scalability and optimized I/O
• Linear scalability to easily add nodes and storage
• Improved query and loading performance
Figure
1:
MPP
Data
Warehouse
Architectures
Scale
Easily
to
Speed
Results
and
Process
More
Data
Figure
1:
MPP
Data
Warehouse
Architectures
Scale
Easily
to
Speed
Results
and
Process
More
Data
1
Moore's law is the observation that over the history of computing hardware, the number of transistors on integrated circuits doubles approximately
every two years. The result is the doubling of computing power at the same cost every 18 to 24 months. http://en.wikipedia.org/wiki/Moore%27s_law
3. An MPP data warehouse will enable more granular data for query, reporting, and
dashboard drill-down and drill-across exploration. Analysis can be performed on
detailed data instead of data aggregates.
On the analytics side, once a model has been developed and business insights have
been gleaned from these data sets, simply migrate the model and/or the insights into
the existing data warehouse for integration into the current business intelligence
environment. Alternatively, analytic modeling can also be done on the MPP platform,
making it part of the production process.
TACTIC #2: STOP MOVING DATA TO THE
ANALYTICS; BRING THE ANALYTICS TO THE
DATA
BENEFITS One of the most dramatic developments in Big Data is the advent of in-database
analytics. In-database analytics addresses one of biggest shortcomings in performing
Leverage low-latency (high- advanced analytics—the requirement to move large amounts of data around. That has
velocity) data access
caused many organizations and data scientists to have to settle with working with
• Drive realtime customer aggregate tables because the data transfer issue is so debilitating to the analytic
acquisition, predictive exploration and discovery process. In-database analytics reverses the process by
maintenance, or network moving the analytic algorithms to where the data is stored, accelerating the
optimization decisions development and deployment of modeling. Elimination of data movement results in
substantial benefits:
• Update analytic models
on-demand based upon • Moving a few terabytes can take hours. With in-database analytics, it drops to
current market or local zero.
weather conditions
• Because the movement of data is the most time-consuming activity in logical
processing time, reducing data movement reduces the processing time by 1/N,
where N is the number of processing units. Processing time for 1 TB can be
reduced by a factor of 16 with only a five-processor system, going from 193
minutes to 12 minutes (see Figure 2).
Figure
2:
In-‐database
Analytics
Dramatically
Speeds
Processing
Time
4. TACTIC #3: USE ALL OF YOUR DATA WITH A
BENEFITS NEXT GENERATION OPERATIONAL DATA STORE
The Hadoop Distributed File System (HDFS) provides a powerful yet inexpensive
Manage a wide variety of
option for modernizing Operational Data Store (ODS) and Data Staging areas. HDFS
structured and unstructured
data sources is a cost-effective, large storage system with an intrinsic computing and analytical
capability (MapReduce). Built on commodity clusters, HDFS simplifies the acquisition
• Integrate unstructured and storage of diverse data sources, whether structured, semi-structured (e.g., web
claims descriptions to logs and sensor feeds), or unstructured (e.g., social media, image, video, and audio).
reduce fraudulent claims Once in the Hadoop file system, MapReduce and commercial Hadoop-based tools are
• Leverage mobile data to available to prepare the data for loading into an existing data warehouse. The ability
create realtime to “define schema on query” versus “define schema on load” simplifies amassing data
promotions from a variety of sources, even if you are not sure when and how you might use that
data later (see Figure 3).
• Leverage sensor readings
to optimize yield and The result is a single platform for feeding both your data warehouse and analytics
pricing environment. This inexpensive, scale-out solution can be used to store ALL of your
data.
BENEFITS
Figure
3:
Use
Hadoop
as
an
Operational
Data
Store
and
Analyze
ALL
of
the
Data
Leverage new metrics,
dimensions, and
dimensional attributes
gleaned from unstructured TACTIC #4: LEVERAGE UNSTRUCTURED DATA
data sources
TO ADD NEW METRICS TO AN ENTERPRISE
• Leverage customers’
interests, passions,
DATA WAREHOUSE
associations, and An easy way to start building experience with Hadoop and MapReduce is to use these
affiliations to improve technologies to create new metrics from an unstructured data source that can be fed
micro-segmentation into the enterprise data warehouse. This will provide the ability to leverage data such
as social, mobile, consumer comments, email, doctors’ notes, or claims descriptions
• Add sensor-generated
to identify new metrics that are better predictors of performance. Most organizations’
performance data into
existing data warehouses are treasure troves of key performance indicators and
your manufacturing,
metrics used to monitor business performance. Use Hadoop and MapReduce to parse
supply chain, or product
through unstructured data to identify new business performance metrics that can be
predictive maintenance
integrated into the existing data warehouse (see Figure 4).
models
5. Figure
4:
Parse
Unstructured
Data
Using
Hadoop/MapReduce
and
Incorporate
Results
into
the
Enterprise
Data
Warehouse
Once these new metrics are in the enterprise data warehouse, they can be used to
enhance existing business intelligence queries, reports, dashboards, and analyses
(see Figure 5).
Figure
5:
Integrate
Social
Media
Metrics
into
the
Existing
BI
Environment
Note: implementing this tactic places companies in a good position as Hadoop
continues its assimilation into the relational database market. Being able to create
metrics and process data on Hadoop, leveraging tools like HBase and Hive that are
evolving quickly, and having BI tools connect directly to HDFS, may make data
warehouse professionals question why they need to move data to a relational
database at all.
MODERNIZE YOUR DATA WAREHOUSE TODAY
In the world of revolutionary, game-changing Big Data developments, data
warehouse modernization may sound like an evolutionary development. However, it is
something that can be executed today, with existing data warehouse skills, and
represents a simple first step toward gleaning immediate business value and
organizational agility from Big Data technologies. Why are you waiting?