The document discusses strategies for organizations to better manage big data when resources are limited. It recommends identifying unused data in the data warehouse in order to reduce costs by moving that data to cheaper platforms like Hadoop. Organizations can save millions by offloading data that is not frequently queried but must be retained for regulatory reasons. The document also suggests purging data that is not needed at all to further reduce storage and management costs. Proper classification and placement of data onto platforms suited to its usage level and type, such as Hadoop for less critical datasets, can help organizations get more value from their data with fewer resources.
2. CONTENTS
Long on Data, Short on Resources 1
Know Your Data 3
ii Reducing Data Maintenance Costs 5
Choose Your Data Platform Wisely 8
ii Reigning in Data Growth Costs 10
Don’t Keep What You Don’t Need 11
ii Overcoming Data Growth and Regulatory Compliance Challenges 12
Getting What You Need to Manage Your Data 14
For More Info 15
3. 1 Big Data Management: Work Smarter Not Harder
We’ve been deluged with
statistics on data’s rapid growth
to the point that the numbers
and bytes have become almost
meaningless. No one would
deny that data growth is an
unstoppable trend. But that’s
not the issue. The real issue is
how organizations can make
big data meaningful when IT
resources are shrinking.
The good news is that business users want more data, and they’re getting
it, but in some cases, more data is actually having an adverse effect on
business. Fifty-six percent of IT decision makers surveyed by IDG said
that their users frequently or occasionally report feeling overwhelmed
by incoming data and information, while 53% said the influx of large
quantities of data has delayed important decisions because they didn’t
have the right tools to properly manage it. Leading companies are
realizing that having the right technology makes all the difference to
assure that data can be used as an asset rather than a liability.
LONG ON DATA, SHORT ON RESOURCES
56% 53%
of IT decision makers said that
their users report feeling
overwhelmed by incoming
data and information
said the influx of data has
delayed decisions
because they didn’t have
the right tools to manage it
IDG Enterprise, 2015
4. Data management staff
as a percentage of IT staff
has risen a meager
Computer Economics, 2015
0.5%
2 Big Data Management: Work Smarter Not Harder
Long on Data, Short on Resources
However, despite the perceived
value of data, the allocation
of resources to manage and
leverage big data has not kept
pace with its growth. According
to research firm Computer
Economics, data management
staff as a percentage of IT staff
has risen a meager .5% in four
years, and IT spending per user
continues to decline. In fact, the
same study showed that when
adjusted for inflation, spending
decreased from $10,514 in 2012
per user to just $6,847 in 2015.
But it’s not all about the money.
Finding people with the necessary skillsets will only grow more
challenging. The McKinsey Global Institute predicts that by 2018 the US
could face a shortage of 140,000-190,000 people with deep analytical
skills as well as a deficit of 1.5 million people who can leverage big data
analysis to make effective decisions. This drives the need for automation
that reduces the skills and training required to manage data.
As is often the case, the best way to address the big data resource and
skills shortage is to work smarter — not harder. In this ebook, we look
at how IT organizations can manage data smarter — while maintaining
or even reducing costs — so that business users can get real value from
data, faster and easier.
5. 3 Big Data Management: Work Smarter Not Harder
Moving data, transforming data,
and making it available to the
business is a very expensive
process. Given data’s rapid rate
of growth — and the amount
of waste in the current data
management paradigm — it’s
time to transform the economics
of data.
Most enterprises leverage a wide variety of data types in high volumes
for big data analytics projects. These include social media data, internal
data, log data, mobile device data, sensor data, free public external
data — and the list keeps growing. In fact, according to QuinStreet
Research, by 2020, the world will generate 50 times as much data as it
does today, but the IT staff responsible for managing it will only grow
1.5 times. On top of that challenge, only 40-55% of the data
that they load is ever used. When you consider that it
costs $2-6 million to support every 50-100 TB
of new data, supporting dormant data
results in a tremendous amount
of inefficiency.
KNOW YOUR DATA
QuinStreet Enterprise Research, 2014
But the IT staff
who manages it
will only grow 1.5Xthe world will generate 50Xas much data
By 2020
6. 4 Big Data Management: Work Smarter Not Harder
Dormant data also slows down performance since the process of loading
data uses up to 60% of the CPU. A lot of data may need to be retained
in its original form for compliance and undergo ETL and transformation
processes for the prospect of using it for other needs, but never get
used. As a result, it’s unnecessarily impacting costs and performance.
Know Your Data
But the exorbitant cost of not
managing dormant data well isn’t
just about the storage. In fact,
it’s less about the storage and
more about CPU capacity. Most
vendors charge by CPU capacity.
As CPU capacity increases, so do
your licensing costs.
Only
40to
55%
of the data
companies load will ever be used
Every 50-100 TB
of new data costs
$2-6 Million
to support it
Cost of Supporting DataData Waste Cost of CPU
Loading
data uses
up to
60%
(License costs go up as CPU capacity increases)
of the
CPU
Source: Based on Attunity customer implementations/input worldwide, 2015
7. CUSTOMER SUCCESS STORY
By offloading 43%
of the EDW into Hadoop
$21M
$5M
DECREASE
Source: Based on Attunity customer implementations/input worldwide, 2015
Yearly
maintenance
costs (in three years!)
5 Big Data Management: Work Smarter Not Harder
Reducing Data Maintenance Costs
By looking at and analyzing
EDW use for just one month, an
Attunity customer discovered
that 37 TB of data — 43% of the
EDW — didn’t receive any kind
of analytical query. And yet the
CPU consumption to ingest and
load the data was over 60%.
By offloading that 43% into Hadoop, the customer dramatically
decreased the need for more capacity, reduced the number of EDW
nodes and lowered maintenance costs. In fact, the customer is looking
at driving down yearly maintenance costs from $21 million to $5
million in just three years — all by being more strategic about data
management.
8. 6 Big Data Management: Work Smarter Not Harder
Know Your Data
The data warehouse is a
reflection of the business. It
grows in response to business
needs. It makes sense then to
analyze data activity and usage
accordingly. When you group
applications, data, or users in
the context of the business (for
example, by department or line
of business), you can then begin
to analyze utilization and assign
accountability via chargeback or
showback. For example, when
marketing requests more data
from IT, the IT department may
need to show them how much
data hasn’t been used, along
with the cost to continue to
manage current and new data.
When a business can specify how much it costs to load and maintain
data, and demonstrate how much isn’t being used, the dataset that
seemed so important before may lose some of its significance. The
standing request might just lose its urgency, particularly if the cost to
keep the data comes out of departmental budgets and ROI is lacking.
To figure out what’s used...
look
at what’s been qu
eried
Source: Based on Attunity customer implementations/input worldwide, 2015
43% of data in the
data warehouse never received a
single analytical query in a month
9. Source: Based on Attunity customer implementations/input worldwide, 2015
Identifying dormant data
recovers storage capacity
nt staff, as a percentage of IT staff, has risen a meager .5%
7 Big Data Management: Work Smarter Not Harder
are consuming CPU capacity. If you do need the data, say for regulatory
reasons, you can offload the processes of ETL to load and transform the
data onto a lower-cost Hadoop cluster. You not only recover storage
capacity, but you also consume less CPU capacity on the system because
of all the data that you’re not loading and ingesting into an EDW.
The key is to gain visibility into the EDW to learn what data is used and
what data is unused.
Identifying dormant data
recovers storage capacity. But it
also helps reduce costs related
to loading and transforming the
data. If you don’t need the data
anymore, you can stop loading
it, which means you eliminate a
portion of the ETL processes that
Know Your Data
10. 8 Big Data Management: Work Smarter Not Harder
As data grows, the platforms that support it increase
in size and multiply because different platforms
optimize different workloads. That’s why placing
data on the right platform is critical to efficiently
managing data as a strategic asset. Enterprises
can realize significant benefits by modernizing and
optimizing data placement.
Not all data is created equal. Some data is of high
value and used for complex analytics while other
data is kept primarily for regulatory purposes —
and then there’s all the data in between. A dataset
should be moved to the most appropriate platform
based on its use case.
CHOOSE YOUR DATA PLATFORM WISELY
Data that’s being loaded, but you
don’t need for the business
Datasets that are being utilized, but
don’t require a high-end data warehouse
Data that should be maintained,
but not used for analytics
Archive or throw away Load and maintain in Hadoop
Load and run batch
analytics in Hadoop
11. 9 Big Data Management: Work Smarter Not Harder
Choose Your Data Platform Wisely
There are three general types of data platforms:
Moving data that’s not queried
but still needs to be maintained
into a lower cost platform
like Hadoop can sometimes
help to support and balance
data growth. As a result, an
enterprise can reduce the need
for more storage capacity and
the number of EDW nodes. This
lowers both maintenance costs
and costs related to adding
more capacity.
The key is to figure out what
you’re loading into each of
these systems, and move
data as necessary to the
most appropriate
platform.
a particular subject area (such
as sales or finance). They may
be fed by data from a data
warehouse or from multiple
source systems. Data marts tend
to be hosted on typical, run-of-
the-mill servers.
ƒƒHadoop
Hadoop is suitable for structured,
unstructured, and semi-
structured data, and can run on
premises or in the cloud. Hadoop
is a great place to load and
maintain high volumes of data
that should be kept but is not
typically used for frequently used,
high-end analytics supporting
many simultaneous users.
Enterprise data warehouse
An enterprise data warehouse
(EDW) is appropriate for
frequently accessed, high-value
data used for complex analytics.
EDWs are high-end engineered
systems designed specifically
for complex analytics and many
simultaneous users — and
they’re priced accordingly. An
EDW is a great place to leverage
high-value data, but it isn’t the
ideal place to store data that you
don’t plan to use anytime soon.
‚‚Data mart
A data mart is more focused
than a data warehouse,
consolidating information for
12. CUSTOMER SUCCESS STORY
Online
Travel
Company Optimized data and workloads
for Hadoop cluster
Reduced
data footprint
on EDW by 30-40%
10Xin cost
savings
Reigning in Data Growth Costs
An online travel company’s 6+ petabyte production IT systems were
growing rapidly within a multi-platform environment that included
Hadoop and several legacy data warehouse systems. The DB2 data
warehouse was already at 300 TB, and adding more capacity was simply
cost prohibitive.
Using Attunity Visibility to balance workloads and data across the data
warehousing environment had a significant impact on costs associated
with data growth. The online travel company reduced its data footprint
on the EDW by 30-40%. Offloading data and associated workloads to
Hadoop saved the company $6 million.
Furthermore, its IT department
can ensure that these cost
savings are maintained by
providing chargeback reports
to business lines. By showing
business users what data is being
used and at what cost, IT can
make a case for moving data to
lower-cost platforms or making
additional investments in IT.
10 Big Data Management: Work Smarter Not Harder
13. 11 Big Data Management: Work Smarter Not Harder
Even as you move data to the
appropriate platform, it behooves you
to consider whether it’s necessary to
keep specific datasets at all. There’s
great potential to lower costs by
purging unused data. Many Attunity
customers report that more than one-
third of data in the data warehouse
never receives a single analytical query
in a month. That’s a huge chunk of data
— and potential cost savings.
In order to determine what data is
worth keeping, IT must analyze data
usage and collaborate across teams to
classify data into four categories:
Category 1: Data that doesn’t need to be kept at all and
can be purged. This data isn’t used for analysis, and it
doesn’t need to be archived.
Category 2: Data that must be kept for
regulatory or other reasons but isn’t being
used for analytical purposes. These datasets
do not require a high-end engineered EDW.
They can be placed in a Hadoop cluster or something less
cost prohibitive. Hadoop is a perfect option because it’s a less
expensive system that allows you to continue to do all the data
processing and maintenance and still have access to the data,
because it’s a live platform. So when you do need the data,
you can access it directly in Hadoop or move it into the data
warehouse for analysis on premises or in the cloud.
DON’T KEEP WHAT YOU DON’T NEED
14. CUSTOMER SUCCESS STORY
Large
Financial
Institution
Capped IT infrastructure investment at existing capacity
Avoided $15M
in upgrade costs
Ready to handle faster
rates of data growth
in the future
12 Big Data Management: Work Smarter Not Harder
Overcoming Data Growth and Regulatory Compliance Challenges
Data growth made it difficult for a leading national bank to manage data
and maintain regulatory compliance. With data growing at 100-150% a
year, the bank was quickly running out of capacity. It expected to spend
$10–15 million in 12–18 months on hardware upgrades. Meanwhile,
IT had no way of tracking who accessed what data at the table and
column level, which is necessary to fulfill regulatory compliance and
audit requests. Attunity Visibility enables the IT organization to make
informed decisions about the datasets and related workloads that can
be rebalanced and optimized with Hadoop. As a result, the institution
capped its IT infrastructure investment at existing capacity to avoid
$15 million in upgrade costs while also empowering its teams to handle
faster rates of data growth in
the future. Attunity Visibility also
helps the bank meet regulatory
compliance requirements and
respond to audit requests in
a timely manner. The solution
identifies user activity related
to specific customer data at a
granular level and generates
weekly audit reports.
15. In order to categorize data, you need
to understand what the datasets are
and what users are doing with them
13 Big Data Management: Work Smarter Not Harder
Category 3: Datasets that are analyzed but don’t require an engineered
EDW, such as large-scale data extracts for offline analytics. SAS is a
good example. Many SAS users access data that’s in a data warehouse,
but they don’t do the analytics in the data warehouse. Instead, they
extract huge amounts of data into the SAS server for data mining. This
use case doesn’t require an engineered system like an EDW. Hadoop
does a great job for batch analytics, and it costs less. You can pull huge
streams of data back to the SAS server and analyze it there.
Category 4: Data that’s widely and repeatedly leveraged by the
business, and therefore suitable for storage in your EDW.
In order to categorize data,
you need to understand what
the datasets are and what
users are doing with them.
You must then get buy-in from
the stakeholders. Show usage
patterns to the business and
collaborate with them to make
decisions in an iterative fashion.
Over time, the returns are
significant.
Don’t Keep What You Don’t Need
16. 14 Big Data Management: Work Smarter Not Harder
Effective data management
requires two primary capabilities:
Integrate and move data
more easily across all major
relational database systems,
enterprise data warehouses,
and cloud and big data
platforms.
‚‚ Tune performance, optimize
data placement, and reduce
costs with metrics on how
the business is utilizing data
and platform resources.
In addition to getting real value out of data, effective data management
enables IT organizations to reduce big data costs. With visibility into
how data is used, IT can work with the business to make informed
decisions about what data is worth keeping and how it should be
stored, and what data can be purged or archived. This practice has even
enabled some IT organizations to cap their IT infrastructure investments
at existing capacity.
Being called on to do more with
less is nothing new for IT. Time
and again, IT organizations
learn to work smarter and leaner
while delivering key services
to the business. Big data
analytics is no different.
GETTING WHAT YOU NEED TO MANAGE YOUR DATA
Pre
pareData
Move
D
ata
Analyze Usage
Effective
Data Management
Capabilities