3. 2 TDWI RESEARCH tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
Data archiving presents various problems in the enterprise today.
Many organizations don’t archive at all. Others mistakenly think that
mere data backups can serve as archives, whereas tape is actually
the final burial place of data, from which it rarely returns. Equally off
base, others believe a data warehouse is an archive. Although it’s true
that data archiving processes exist today in some organizations, these
are rarely formalized or policy driven, such that data is archived in an
ad hoc fashion (typically per application or per department) without an
enterprise standard or strategy.
Even when an organization makes an honest attempt at an enterprise
data archive, the result is usually not trustworthy (because
data is easily altered), not auditable (due to poor metadata and
documentation), not compliant (due to inadequate usage monitoring or
the inability to purge data at specified milestones), and not properly
secured (lacking encryption, masking, and security standards).
Furthermore, with most existing data archives, it’s hard to get data in
with integrity and out with speed because the primary platform is not
online, active, and highly available.
Why don’t more organizations invest in formal archiving processes and
technical solutions? Most likely it’s their common belief that archives
provide little or no return on investment (ROI) because users rarely
(if ever) access the archive. Without prominent and frequent usage, a
respectable ROI is unlikely.
A data archive can achieve ROI by serving multiple uses and
users from an online, active platform. Yes, organizations do need
to retain data; that’s not in question. However, archived data is not
just insurance for compliance, audit, and legal contingencies. Those
are important goals, but a data archive should also be treated as an
enterprise asset to be leveraged, typically via analytics. Hence, a data
archive can be more than a cost center; it can achieve ROI when it
serves multiple uses (archiving, compliance, and analytics of deep
historical data sets) and it manages data online for active access at
any time by a wide range of users.
Users must start planning today for active data archiving. To help
them prepare, this TDWI Checklist report will drill into the desirable
attributes, use cases, user best practices, and enabling technologies
of active data archiving.
FOREWORD
There are compelling reasons for improving data archives.
Traditional reasons for data archives still apply: namely, supplying
data for compliance, audit, and legal requirements. However, a
modern online data archive brings greater speed, accuracy, and
credibility to these tasks so they are a smaller drain on enterprise
processes and resources.
New reasons have come into play as well: namely, organizations’
voracious hunger for actionable insights discovered through advanced
analysis of raw source data, big data, and a broadening diversity of
data types. One of the most influential changes, however, concerns
the state-of-the-art in data platforms—both hardware and software.
Their speed, scale, and functionality continue to rise even as their
costs fall, which in turn makes the improvement of users’ data
archive solutions feasible for both technical and financial reasons.
Active data archiving can address these problems and
opportunities. Enterprises need to embrace the emerging practice of
active data archiving along with its enabling technologies. A modern
solution for active data archiving will:
• Be built primarily for compliance or data governance but also
serve the archival needs of analytics and sometimes data backup
and disaster recovery.
• Be open to active access by a wide range of users, including
those who need simple lookups and easy data exploration.
• Manage data as an immutable record that cannot be altered so
that data is trustworthy for compliance and legal requirements.
• Be secured like a bank vault, for data security, privacy, and trust,
using role-based permission access, data masking, encryption,
and multiple data security standards.
• Scale up to multi-terabyte and petabyte data volumes using
fast bulk loads and data compression to embrace new big data
sources and because archives inevitably grow over time.
• Operate online with high availability around the clock to enable
active data loads and extracts that keep the archive current up
to the minute. Furthermore, data is constantly appended to an
active archive without downtime or performance degradation.
• Support high-performance access based on SQL and other
standards because users expect quick responses as they run
queries and searches against archived data.
NUMBER ONE
EMBRACE MODERN PRACTICES AND PLATFORMS FOR
ACTIVE DATA ARCHIVING
4. 3 TDWI RESEARCH tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
Two broad archive categories—defined by their content and the
primary use of that information—can coexist and overlap in active
data archiving solutions:
• Compliance archives: Data retained in content, format, and for
timeframes prescribed by legislation and other regulations (e.g.,
partners, lenders, and legal liabilities)
• Analytic archives: Detailed source data from operational
and transactional applications, extracted for general business
intelligence purposes but retained for advanced analytics (as
defined in the next section of this report)
Compliance archives have a number of desirable process and
technical attributes:
Data that’s properly archived is solid evidence of an
organization’s compliance. In legal terms, honest attempts at
archiving constitute proper intent, whereas a lack of archiving may
be construed as malfeasance.
Data archived for compliance must support appropriate
regulations. These vary by industry. For example, in the United
States, the most stringent regulations target banking and the
financial services industry as seen in the Dodd-Frank legislation
or SEC Rule 17a-4. Similarly, the telecommunications industry is
subject to legal hold and lawful intercept requirements that demand
timed data retention.
Archived data must be tamper proof to be trusted. Most is
captured and stored in original form so it’s a credible representation
of a transaction, report, business process, or other event at a
specific time. If archived data becomes altered, it is no longer
considered credible. For example, stock trades are stored for exact
timeframes, to protect both trader and institution. Transparency is
of the utmost importance to compliance archives, and WORM (write
once, read many times) storage has become key.
Archived data demands a convincingly documented audit trail.
Most audits commence with a request for information, followed
by a request for an audit trail for supplied information. With data
stored properly in an active archive, audits go faster—perhaps more
accurately, too—than with traditional offline, ad hoc archives. The
speedy, documented response builds confidence with auditing bodies
and contributes to favorable outcomes.
An active data archive should have tracking functions so an
organization can monitor and study its own activities to assure
compliance and make improvements. The same tracking functions
can flag data that has aged beyond its compliance requirements
and should be deleted.
ASSURE AND IMPROVE DATA GOVERNANCE BY USING
A COMPLIANCE DATA ARCHIVE
NUMBER TWO
Archiving operational data for analytic purposes is on the rise.
As more advanced forms of analytics have gained credence over the
last 15 to 20 years, user organizations have been retaining more
detailed source data. The traditional practice was to extract data
from operational applications and other sources, process that data
and load the results into a DW, then delete the extracted source
data. The accepted practice today keeps most source data because
it is also the preferred material for analytics based on data mining,
statistical analyses, natural language processing, and SQL-based
analytics.
An analytic archive and a data warehouse are similar but
different. Because of the stepped-up data retention, the data
staging areas within most data warehouse architectures today
are bigger than their core warehouses. This is tantamount to data
archiving, though few BI/DW professionals call it archiving. All they
know is that they have to do something to improve the content and
accessibility of their analytic data archives. Furthermore, they need
to offload this burden from core warehouses, which have higher
priorities than analytics (namely reporting, OLAP, and performance
management). Hence, as BI/DW professionals ponder where to put
certain classes of analytic data, they should consider a platform for
active data archiving.
An analytic archive easily integrates with multi-platform DW
architectures. DW system architectures have always been multi-
platform, but this trend has accelerated in recent years as users
have extended their DW environments by adding new platforms for
columnar databases, appliances, NoSQL, and Hadoop. An additional
platform—one that specializes in archiving data for advanced
analytics—would wring more value from archived source data and
easily integrate with multi-platform DW architectures.
A data archive can future-proof analytic applications. Most data
warehouses are designed by their users (not vendors) for the data
requirements of reporting, OLAP, and performance management.
These practices need calculated, aggregated, standardized, and
time-series numeric values modeled in multidimensional structures
that don’t exist in source systems. Advanced analytics has different
data requirements. It needs a very large store of unaltered (or lightly
transformed) detailed source data. Other than that, it’s impossible
to anticipate data requirements for future analytic applications (AA).
Accordingly, an analytic archive preserves source data in its original
form, so the source is there for future AAs to explore and repurpose.
CONSIDER AN ANALYTICS ARCHIVE FOR CRITICAL,
HIGH-VALUE, AND AGING ANALYTICS DATA
NUMBER THREE
5. 4 TDWI RESEARCH tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
A data archive has to be more than a dumping ground. For one
thing, there needs to be a strategy based on new and evolving
user requirements for aging, less frequently accessed data and
other metrics for identifying which data should be archived at
what level and on what schedule. Note that not all data should
be archived: some data belongs elsewhere, say, in its original
application database or in a data warehouse. Archive specialists
need to interview a broad range of business users and managers to
determine users’ needs for archived data. If your organization has
a legal department and compliance officers, give priority to their
needs but without neglecting the rest of the enterprise.
On a technology level, develop interfaces and integration logic for
getting data into the archive quickly and in lightly transformed
states that are conducive to query and search, without altering
the essential content of archived data. Finally, assume that all the
data in the archive needs an audit trail and documentation (via
metadata, etc.) that is sufficient to satisfy even the most aggressive
users and auditors.
What if data comes from applications that have been upgraded or
customized (which can alter data models)? Look for a data archiving
platform that can manage changing data models. That way, the
platform understands changes to source schema and adjusts
metadata and pointers accordingly.
What if archived data comes from an application that was
decommissioned (also known as application retirement)? When
the only application that can read a dataset with full integrity is
gone, that application’s data may need to be lightly transformed
before entering an archive (or after it’s in the archive) so it can be
easily accessed by common query and search tools. This practice is
inspired by data warehousing but it does not require the full-blown
time, skills, and expense of the average data warehouse.
Some archived data needs encryption (for security) or compression
(to reduce its storage footprint). Look for a platform that can apply
these and other data operations as data enters the archive or after
data is in the archive. Furthermore, as data growth rates continue to
rise over time and business demands for retaining older data grow,
data should be stored in a compressed state to optimize storage
capacity and scale over time. Similarly, the security classification of
data can change as organizational rules and policies evolve.
RETHINK HOW DATA IS COMMITTED TO AN ARCHIVE
NUMBER FOUR
Let’s be honest: We’ve all worked in organizations where archives
were purely pro forma, without a credible effort to preserve data in
a state that’s quickly or easily accessed by anyone, much less the
growing number of employees who can benefit from accessing the
information. Luckily, this old “worst practice” is giving way to the
realization that all enterprise datasets—including archived data—
are valuable assets that can contribute to many business goals. The
recent craze for analytics with big data has led many organizations
to seek more business value from their datasets.
With that in mind, active data archiving is a bit of a cultural shock
in some organizations. To get past the shock, these organizations
need upper management to define a mandate for modern archiving
based on the following goals:
Archived data must be leveraged. Typical use cases include
fast, documented auditing for compliance, a source for analytic
applications, data exploration, and information lookups.
Some data will come out of the archive to be used elsewhere.
To enable a broad range of users, tools, and purposes, the archive
should support both query and search mechanisms. Furthermore, the
archive should serve as a source for other data platforms, especially
those for business intelligence and analytics.
A growing constituency of users will have access to archived
data. This is a sticky point in organizations that define data
governance and compliance as the process of limiting data access.
The catch is to balance access and control, typically through well-
defined user types controlled via role-based user access and strong
security features in the archival platform.
Accessing archived data will be timely. First, to be truly active,
the archive must be online like a database, not offline like magnetic
tapes and optical disks or any media that demand a distracting
and time-consuming restoration process. Second, data access
mechanisms should perform at or near real time for the sake of user
productivity.
RETHINK HOW ARCHIVED DATA IS ACCESSED AND
USED ACTIVELY
NUMBER FIVE
6. 5 TDWI RESEARCH tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
For a data archive to be truly active, its primary tier should be
based on a robust database management system (DBMS). The
DBMS must include traditional relational functions (for query and
data exploration) and functions for multiple security strategies,
scalability, and high availability. The assumptions here are that
most data being archived will be structured and that most users
and applications will need to access data via queries. Even so, some
functions of the DBMS should be controlled; for example, inserting
and updating data can destroy data’s original state, whereas
appending data avoids such integrity problems. In addition to
relational technology, free text search is critical to finding records of
interest and to enabling non-technical users.
An active data archiving platform can host many archives, each
with its own unique requirements, similar to how a DBMS can
manage several databases (defined as collections of data). Thus,
multi-tenancy is another key assumption for a modern data archive.
In most cases, an archive platform is not a data processing or
analytics platform. Hence, archived data is best extracted, then
moved to a DBMS or other data platform that is more conducive
to in-database analytics, intense SQL-based analytics, and
miscellaneous forms of advanced analytics. For these purposes,
mature organizations already have in place relational data
warehouses, columnar databases, and DW appliances, possibly
NoSQL databases and Hadoop. As an exception, when an active
archive runs atop Hadoop, it may make sense to process and
analyze data on the same platform where it’s archived. Note that
the DBMS in the primary tier of a data archive does not replace
other DBMSs, especially not those deployed for analytics. Instead, it
complements them and (in addition to its archival purpose) serves
as yet another source of data for analytics (largely historical data).
The storage tier of an active archive should be diverse. This is
to accommodate subsystems users already have as well as newer
commodity-priced types such as CAS hardware or the Hadoop
Distributed File System (HDFS). Even a modern active archive might
include systems for magnetic tape and optical disk in the storage
tier. After all, many organizations have pre-existing mag tape or
op disk libraries that they must maintain. Note that these archaic
media are antithetical to an active data archive; if possible, their
data should be migrated into the active archive so it’s online and
available when users need it.
In the case of a compliance archive (for, say, a financial services
institution), the archive must reside in a WORM storage platform.
This, in turn, requires a DBMS that supports WORM devices.
WORM technologies are worth the investment because they keep
DEPLOY ARCHIVING SYSTEMS THAT HAVE MULTIPLE
STORAGE AND PROCESSING TIERS
NUMBER SIX
compliance and risk officers happy and they avoid fines, penalties,
and damaging publicity.
Users should consider Hadoop as both a highly scalable storage
platform for archiving and a low-cost processing platform for
analytics. Note that open-source Hadoop’s poor support for two key
standards—SQL (and other relational technologies) and security
(especially LDAP and Linux PAM)—keeps it unpalatable for mature
IT organizations.
Despite these two limitations, Hadoop has roles to play in multi-
platform archive architectures. Hadoop excels with very large data
volumes, as well as with file-based data, data documents (XML
and JSON), textual content (e-mail and word processing files),
unstructured and non-relational structured data, and schema-free
data. Hadoop’s low price is appropriate to many kinds of lower-value
(but high-volume) historic data, such as Web logs. However, due
to limitations in current releases, purely open-source Hadoop may
not be the best choice for structured data that needs relational
processing (such as intense SQL or multi-way joins) or sensitive
data that demands high security. That’s not a show stopper because
a number of software vendors offer products that integrate with
Hadoop to give it stronger and broader support for security and
relational technologies like standard SQL.
Consider economics as you select platforms, tools, and
features for a new active archiving architecture. For example,
it’s technically possible to include almost any brand of relational
DBMS in an archiving solution. However, the older and more mature
vendor brands are relatively expensive, especially once an archive
scales into multi-terabytes, and they include far more features and
functions than are required for archiving. A more cost-effective
choice is a DBMS designed for archiving or one of the newer
columnar, open-source, or appliance-based DBMSs. In this context,
Hadoop is affordable in terms of dollars per terabyte of storage.
Similarly, data compression is a feature that can reduce storage
costs because it reduces the footprint of archived data in storage.
7. 6 TDWI RESEARCH tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
Put succinctly, if an archive isn’t secure, it won’t meet the
compliance goals that are its primary purpose. Furthermore, if users
don’t trust the security of the archival platform, they won’t use it or
its data, and the archive will fail to demonstrate a positive ROI.
The primary line of defense is the security layer built into the
relational DBMS at the heart of an active data archiving platform.
Most mature IT departments and DBMS teams prefer role-based
approaches to security, and many have LDAP and other directories
they’d like to reuse and apply within the active archiving solution.
If Hadoop is to be part of an active archive’s infrastructure, note
that security in purely open-source Hadoop today is mostly about
general access privileges controlled through Kerberos. However, a
few third parties now offer add-on products that enable LDAP, Active
Directory, and other approaches to security for the Hadoop family of
products.
Almost all modern data archives are loaded with sensitive data
about customers, partners, employees, Social Security numbers,
credit card numbers, transactions, internal financials, and so on.
Encryption or data masking can make this data unreadable in the
eventuality of a hack or other unauthorized access.
Additional layers of data protection may be used to keep data locked
and immutable. This provides evidence that data records and files
have not been altered, which is fundamental to a credible audit.
Likewise, records and files cannot be deleted before their retention
periods expire.
MAKE SECURITY A HIGH PRIORITY BECAUSE IT WILL
MAKE OR BREAK AN ARCHIVE
NUMBER SEVEN
8. 7 TDWI RESEARCH tdwi.org
TDWI CHECKLIST REPORT: ACTIVE DATA ARCHIVING FOR BIG DATA, COMPLIANCE, AND ANALYTICS
TDWI Research provides research and advice for business
intelligence and data warehousing professionals worldwide. TDWI
Research focuses exclusively on BI/DW issues and teams up with
industry thought leaders and practitioners to deliver both broad
and deep understanding of the business and technical challenges
surrounding the deployment and use of business intelligence
and data warehousing solutions. TDWI Research offers in-depth
research reports, commentary, inquiry services, and topical
conferences as well as strategic planning services to user and
vendor organizations.
ABOUT TDWI RESEARCH
ABOUT THE AUTHOR
Philip Russom is the research director for data management
at The Data Warehousing Institute (TDWI), where he oversees
many of TDWI’s research-oriented publications, services, and
events. He’s been an industry analyst at Forrester Research and
Giga Information Group, where he researched, wrote, spoke, and
consulted about BI issues. Before that, Russom worked in technical
and marketing positions for various database vendors. Over the
years, Russom has produced over 500 publications and speeches.
You can reach him at prussom@tdwi.org.
TDWI Checklist Reports provide an overview of success factors for
a specific project in business intelligence, data warehousing, or
a related data management discipline. Companies may use this
overview to get organized before beginning a project or to identify
goals and areas of improvement for current projects.
ABOUT THE TDWI CHECKLIST REPORT SERIES
www.rainstor.com
RainStor provides the world’s most efficient database solutions
that reduce the cost, complexity, and compliance risk of managing
data. Delivering solutions to the enterprise, you can quickly deploy
an Analytical Archive or Compliance Archive so you continue to
create business value and stay compliant. RainStor runs anywhere:
on-premises or in the cloud and natively on Hadoop. Among
RainStor’s customers are 20 of the world’s largest communications
providers and 10 of the biggest banks and financial services
organizations, which use RainStor to manage historical data,
while saving millions. For more info: www.rainstor.com or join the
conversation: @rainstor.
ABOUT OUR SPONSOR