For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
In this paper, Impetus focuses at why organizations need to design an Enterprise Data Warehouse (EDW) to support the business analytics derived from the Big Data.
CNIC Information System with Pakdata Cf In Pakistan
Building a Big Data Analytics Platform- Impetus White Paper
1. Building a Big Data Analytics Platform-
Going beyond the Traditional
Enterprise Data warehouse
W H I T E P A P E R
Abstract
In this white paper, Impetus Technologies focuses on the need
for building a Big Data analytics platform for better business
insights.
It also looks at why organizations need to design an Enterprise
Data Warehouse (EDW) to support the business analytics derived
from the Big Data.
Additionally, it discusses the options and challenges of building a
successful EDW architecture to meet the new Big Data business
requirements. It talks about why it may include extreme
integration with semi-structured and unstructured data sources,
that could be very large in size, or could be streaming data,
accessed through Hadoop, as well as massively parallel
databases.
Impetus Technologies, Inc.
www.impetus.com
2. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
2
Table of Contents
Introduction..............................................................................................3
Limitations of traditional EDWs................................................................4
The key features of a Big Data Analytics platform ...................................5
Options available for building the Big Data platform...............................6
Using Open Source to build Big Data solutions ........................................7
Opting for a Hybrid solution .....................................................................8
Harnessing existing investments in building a Big Data Analytics platform
................................................................................................................10
Summary.................................................................................................12
3. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
3
Introduction
In the post recession world, organizations are under pressure to maximize
profits and reduce expenditure. Business owners need to find the right target
users, figure out the distribution channels, successfully sell their offerings; as
well as keep all the stakeholders happy.
Moreover, every time the business comes up with new products or campaigns,
or wishes to evaluate its existing business performance, it has to deal with the
following questions: What kind of products are my customers interested in?
Where should I open my new store next year? What is the most effective
Distribution channel?
Traditionally, businesses have used Enterprise Data Warehouses (EDW)
solutions for providing analytics and gaining deeper insights to address their
business requirements and expansion plans.
An EDW can play a pivotal role in an enterprise IT strategy. A comprehensive
EDW plan provides companies the following benefits:
• Enables disciplined data integration within a large enterprise
• Generates output and facilitates effective representations of all
business processes
It’s important to examine how the traditional EDW works. Traditional data
sources include an operational DB, old archived data, flat/xml files or ERP
systems. Here, the data is extracted, cleaned and transformed into the desired
format and then loaded into the data warehouse storage system. This data can
be further divided into marts. Once the data is available in the central EDW,
query or reporting tools are used for analytics. However, for deeper or forecast-
based analysis, data mining tools are used.
The question however is whether such data warehouses are ready to deal with
Big Data and more importantly, what is Big Data?
The term Big Data is used to describe data sets which cannot be managed or
processed by traditionally used software tools within an agreed elapsed time.
The Big data size is constantly increasing, and can range from a few terabytes to
many petabytes. However, it is expected to reach around 35 zettabytes by the
year 2020!
4. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
4
Traditional Enterprise Data Warehouses have fallen short of expectations when
it comes to handling Big Data, on account of the following reasons:
• Inability to handle large data sizes
• Storing and Managing the Big Data
• Gaining insights from this data
• Costs involved in dealing with Big Data
Limitations of traditional EDWs
Let us examine the limitations of traditional EDWs.
Traditionally, Enterprise Data Warehouses focused only on transactional or
archived data. However, in the last few years, the need to capture additional
data for deeper insights has come-up. This includes, real time data, which may
be the low latency operational data or customer behavior data, which captures
the sub-transactional processes. At the same time, additional data sources such
as devices and sensors have also emerged.
Social Media also provides valuable information on product preferences and
user sentiments. It is extremely useful for generating business intelligence, from
the large unstructured data generated from the Web applications.
5. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
5
It is clear that traditional EDWs cannot gain meaningful insights from Big Data.
This is possibly because traditional EDWs were just not meant to handle TBs and
PBs of data. Most of these systems were designed in the 1990s using database
technologies.
Another difference is that in place of Extra Transform Load, the Big Data
Warehouses need ELTL which is Extract-Load-Transform-Load. The new system
needs a staging area where data is uploaded before the
cleansing/transformations operations.
Traditional relational database solutions are not suitable for a majority of data
sets. The data is too unstructured and/or too voluminous for a traditional
RDBMS to handle. Big Data cannot be analyzed with SQL or similar technologies.
In fact, database schema does not allow complex unstructured formats to be
defined and managed in these data warehouses. Moreover, the costs involved
in handling these new data sets by using traditional technologies is also very
high.
Clearly, existing EDW environments, which were designed decades ago, lack the
ability to capture and process the new forms of data within reasonable
processing times. Moreover, these traditional EDWs have limited capabilities
when it comes to analyzing user behavioral data.
Cost is another important factor. Currently, organizations are spending
hundreds of thousands of dollars per terabyte per year for producing and
replacing data in their existing environments, which is huge. Additionally, the
models in use tend to require specialized hardware, which in-turn results in big
dollars-per-terabyte cost, making large-scale deployments expensive. It is also
really hard to predict the infrastructure workload for managing this Big Data.
The key features of a Big Data Analytics
platform
To manage the Big Data trend, a new breed of Big Data Open Source and
proprietary technologies have come up, that leverage commodity hardware. A
Big Data Analytics platform helps capture and analyze these new data sets.
The ideal Big Data Analytics platform needs to match up to these key
characteristics:
• It should have the ability to scale easily to support large data, which will
typically be in terabytes or petabytes.
6. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
6
• The system should ideally be distributed across geographically unaware
processors.
• It should enable quick response to highly complex queries as well as
support a wide variety of data types
• It should be able to incorporate machine learning, providing
recommendations, and executing analytics on real time incoming data
such as logs, as well as providing domain specific canned reports.
• It should be able to handle data from heterogeneous data sources,
while providing a high rate for loading and analysis, as well as the ability
to handle failover.
Options available for building the Big Data
platform
It is important to understand that for building a Big Data analytics platform, any
single vendor technology may not be sufficient. The platform should have
certain capabilities to address specific sets of requirements.
There are two different approaches that are being used to address Big Data
analytics.
The first one is using Massive Parallel Processing and Columnar Databases. This
solution can help address scaling, distribution, load management, response time
7. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
7
and failover management issues. Additionally, it may also have some domain
specific capabilities to provide a ready-made solution.
The second option is using MapReduce implementations. This framework was
initially used by Google to perform Web searches and is now easily available as
the Open Source Apache project called Hadoop.
Companies therefore, have the option to choose between Open Source
solutions and commercial options. However, they can also build a hybrid
solution, which has a mix of different capabilities that handle the Big Data
paradigm.
The commercial tools of today have strong analytical proficiencies as well as
sophisticated reporting and OLAP cube capabilities. There are a large number of
vendors in the market who are offering solutions for the main components of
the EDWs, which are ETL, query tools and BI.
Some of the commercial options for MPP are GreenPlum, Teradata,etc.
Informatica is an example of ETL. A few commercial solutions for BI and
Analytics are Pentaho, Business Objects, MicroStrategy, among others. It is
possible to build a Big Data warehouse solution using these commercial
products together.
Using Open Source to build Big Data solutions
Every organization, big or small, is now focused on cutting IT expenditure.
Despite this, business analytics remains a major business driver for these
companies. If the commercial solutions are scaled to really huge volumes and
deeper BI, it can result in exorbitant licensing costs.
This is clearly not a viable proposition. Companies can instead choose from the
numerous Open Source implementations that are available. Lower costs,
extensibility, and integration are some of the benefits that organizations realize
from Open Source solutions. The good news is that the community is
continuously making efforts to enhance these features and add new
functionalities to these solutions.
Some of the Open Source solutions stacks in the analytics world are jasper soft,
and Pantaho Reporting, while the ETL tools are lover ETL, Talend, etc. Pentaho
also provides commercial extensions of its solution, while Apache Hadoop and
Cassandra provide implementations to the MapReduce framework. These
products solve huge data storage issues and provide ETL and analytics support.
8. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
8
Opting for a Hybrid solution
In this scenario, it is possible to use an Open Source solution for ETL or BI and a
commercial solution for Analytics, or vice-versa. Hadoop and MPP solutions for
instance, can work together as ETL pipes along with a commercial Analytics tool.
Alternatively, MPP and columnar databases can be chosen, along with Map Reduce to
provide another perfect hybrid solution.
When there are larger volumes of data to be analyzed, organizations are better-off
using Open Source solutions. Hadoop is one of the best available Open Source
solutions that can help them in handling their Big Data in a cost-effective manner. It
also makes sense to use parallel processing or other fast mechanisms while trying to
import from the source system or export to the destination system.
Incidentally, ‘real time’ is a myth in Big Data. The data warehouse system has to be
carefully designed so that real time data can be limited by size or by time. It is possible
to re-use some of the existing EDW investments in building a Big Data platform.
9. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
9
The Impetus solution
Based on its project experiences, Impetus Technologies has built a Big Data
Analytics platform for its clients that can help them roll out their Big Data
Analytics initiatives. The platform is called iLaDaP, which is short for Impetus
Large Data Analytics Platform.
The core of the iLaDaP platform is built using SOA, and incorporates all the key
characteristics of an ideal Big Data Analytics platform discussed earlier. iLaDaP is
designed to derive intelligence and operate on huge datasets collected from
numerous data sources in multiple data formats. It is powered by Hadoop, and
therefore, can linearly scale up to thousands of nodes using commodity
hardware. This spells a significant cost advantage in the long run. iLaDaPalso
comes with a set of pre-canned and customized reports.
10. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
10
Recognizing that it is important for businesses to track down and take advantage of an
opportunity, as it happens, Impetus’ platform enables them to react to the events as
they occur. iLaDaP is also capable of collecting data from a range of disparate sources.
This unstructured data can be transformed and utilized for strategic business
decisions.
iLaDaP can be seamlessly integrated with current platforms, without the need for
major changes. The core iLaDaP platform is built using Open Source technologies,
where the components can be replaced with other commercial technologies, in
accordance with requirements.
Harnessing existing investments in building a
Big Data Analytics platform
It is possible to reuse investments made in the traditional data warehouse, to
build a Big Data Analytics platform. It is possible to reuse most of the hardware
since the Big Data solutions can run on commodity grade hardware. Therefore,
an existing RDBMS-based infrastructure can be reutilized. The existing code
logic and algorithms can be also used after minor modifications to enable them
to run in a state-less architectural environment. In this scenario, tools like
MATLAB can be integrated with Hadoop-like technologies.
Another way of utilizing the data warehouse investments is by extending or
enhancing their capacity by plugging them together with a Big Data warehouse
11. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse
11
solution. Hadoop for example, is a cost-effective option for storing archival data;
performing deeper analytics and providing summarized reporting data to an
existing data warehouse. This strategy can also help in reusing the reporting
tools. Similarly, ETL tools can be modified to use the Big Data warehouse as
sinks. Tools like Talend or Informatica provide connectors for using Hadoop and
commercial MPPs as data sinks.
The development and testing strategy can also be reused. Most of the new Big
Data warehouse solutions support SQL or Java or scripting languages and allow
the re-use of existing development and testing investments.
Organizations can deploy
iLaDaP on-premise, as
well as in a Cloud
supported deployment
set up.