SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Hadoop Enterprise
Readiness
Dell | Hadoop White Paper
By Aurelian Dumitru




Dell | Hadoop White Paper Series
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness


Table of Contents
Introduction                                                                                                                               3
Audience                                                                                                                                   3
The value of big data analytics in the enterprise                                                                                          3
Case study: Using big data analytics to optimize/automate IT operations                                                                    7
Big data analytics challenges in the enterprise                                                                                            9
The adoption of Hadoop technology                                                                                                          9
Hadoop technical strengths and weaknesses                                                                                                  10
Dell | Hadoop solutions                                                                                                                    10
Dell | Hadoop for the enterprise                                                                                                           12
About the author                                                                                                                           15
Special thanks                                                                                                                             15
About Dell Next Generation Computing Solutions                                                                                             15
Hadoop ecosystem component “decoder ring”                                                                                                  15
Bibliography                                                                                                                               16
To learn more                                                                                                                              16




This white paper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without express
or implied warranties of any kind.

© 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden.
For more information, contact Dell.


                                                                                       2
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

Introduction
This white paper describes the benefits and challenges of leveraging big data analytics in an enterprise environment.

The white paper begins with a holistic view of business process phases and highlights ways in which analytics may stimulate
better business operational efficiency, drive higher returns from existing or new investments, and also help business leaders
make rapid adjustments to the business strategy in response to varying market trends and/or customer demands.

The white paper continues with a case study of how big data analytics helps information technology (IT) departments run
information systems more efficiently and with little or no downtime.

Lastly, the paper introduces the Dell | Hadoop solutions and presents several best practices for deploying and using Hadoop in
the enterprise.


Audience
Dell intends this white paper for anyone in the business or IT community who wants to learn about the advantages and
challenges of implementing and using big data analytics solutions (like Hadoop) in a production environment. Readers should
be familiar with general concepts of business process design and implementation and also with the correlation between
business processes and IT practices.


The value of big data analytics in the enterprise
Business processes define the way business activities are performed, the expected set of inputs, and the desired outcomes.
Business processes often integrate business units, workgroups, infrastructures, business partners, etc. to achieve key
performance goals (i.e. strategy, operations, functionality). Business process adjustments and improvements are expected as
the company attempts to improve its operations or to create a competitive advantage. Business process maturity and
execution excellence are the core competencies of any modern company. Switching from last decade’s product-centric
business model to today’s customer-driven model requires reengineering of the business processes (i.e. just-in-time business
intelligence) along with deeper collaboration among departments.[2]

Enterprise business processes relate to cross-functional management of work activities across the boundaries of the various
departments (or business units) within a large enterprise. Controlling the sequence of work activities (and the corresponding
information flow) while delivering to customer’s needs is fundamental to the successful implementation and execution of the
business process. Because of its intrinsic complexity, enterprises start taking a process-centric approach to designing,
planning, monitoring, and automating the business operations. One example of such approach stands out: Business Process
Management (BPM).

“BPM is a holistic management approach focused on aligning all aspects of an organization with the wants and needs of
clients. It promotes business effectiveness and efficiency while striving for innovation, flexibility, and integration with
technology. BPM attempts to improve processes continuously.”[3]




                                                                    3
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

The main BPM phases (Figure 1) and their respective owners are:
    1. Vision—Functional leads in an organization create the strategic goals for the organization. The vision can be based
        on market research, internal strategy objectives, etc.

    2.    Design & Simulate—Design leads in the organization work to identify existing processes or to design “to-be”
          processes. The result is a theoretical model that is tested against combinations of variables. Special consideration is
          given to “what if” scenarios. The aim of this step is to ensure that the design delivers the key performance goals
          established in the Vision phase.

                                                                              3. Implement—The theoretical design is adopted
                                                                         within the organization. A high degree of automation
                                                                         and integration are two key ingredients for successful
                                                                         implementation. Other key elements may be personnel
                                                                         training, user documentation, streamlined support, etc.

                                                                              4. Execute—The process is fully adopted within
                                                                         the organization. Special measures and procedures are
                                                                         being put in place to enable the organization to
                                                                         investigate/monitor the execution of the process and
                                                                         test it against established performance objectives. An
                                                                         example of such measures and procedures is what
                                                                         Gartner defines as Business Activity Monitoring (BAM)
                                                                         [4].

                                                                              5.   Monitor & Optimize—The process is being

         Figure 1: Business Process Management (BPM) Phases              monitored against performance objectives. Actual
                                                                         performance statistics are gathered and analyzed.
                                                                         Example of such statistics can be the measure of how
          quickly an online order is processed and sent for shipping. In addition, these statistics can be used to work with other
          organizations to improve their connected processes. The highest possible degree of automation can help
          tremendously. Automation can cut costs, save time, add value, and eventually lead to competitive advantage. Process
          Mining [5] is a collection of tools and methods related to process monitoring.


How can analytics help the business?
In 2005 Gartner released a thought-provoking study about combining business intelligence with a business process platform.
This results in what Gartner calls an environment in which processes are self-configuring and driven by clients or transactions.
The real challenge with such an endeavor is mapping business rules to intelligent processes that, by definition, need to be
self-configurable and transaction-driven.




                                                                     4
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

Recent advancements in high-volume data management technologies and data analysis algorithms make the mapping from
business rules to intelligent processes plausible. First, analytics enable flow automation and monitoring. Second, removal of
manual steps helps improve process reliability and efficiency. Third, analytics can become one of the driving factors for
continuous optimizations of business processes in the enterprise.

In conclusion, analytics can be the foundation for the environment that Gartner had envisioned (Figure 2).

                                                                Embedding analytics into the process lifecycle has
                                                                tremendous benefits.

                                                                For example, during the Vision phase, functional leads need to
                                                                understand market trends, customer behavior, internal
                                                                business challenges, etc. Being able to comb through treasure
                                                                troves of data quickly and pick the right signals impacts the
                                                                long-term profitability of the business.

                                                                Reliance on analytics during the Design & Simulate phase
                                                                helps the designers rule out suboptimal designs.

                                                                During the Execute and Monitor & Optimize phases, analytics
                                                                can provide automation, ongoing performance evaluation, and
                                                                decision-making.

                                                                Why can analytics be the business processes foundation?
                                                                Although analytics use cases vary between each BPM phase,
                                                                they all seem to answer the same basic questions: What
                                                                happened? Why did it happen? Will it happen again?—etc. This
                                                                convergence should be expected. In biology, convergent
                                                                evolution is a powerful explanatory paradigm. [1] “Convergent
           Figure 2: BPM + Analytics Environment                evolution describes the acquisition of the same biological trait
                                                                in unrelated lineages. The wing is a classic example. Although
                                                                their last common ancestor did not have wings, birds and bats
do.” [7] A similar phenomenon is occurring in the business analytics world because although different questions demand
different answers, the algorithms that generate the answers are fairly similar.

The different use cases are converging into three categories of analytics [6] (Figure 3):
    1. Reporting Analytics process historic data for purposes of reporting statistical information and interpreting the
         insights identified by analysing the data

    2.   Forecast Analytics begins with a summary constructed from historic data and defines scenarios for better outcomes
         (“Model & Simulate”)
    3.   Predictive Analytics encompasses the previous two categories and adds strategic decision-making.

Reporting Analytics helps analysts characterize the performance of a process instance by aggregating historical data and
presenting it in a human-readable interpretation (i.e. spreadsheets, dashboards, etc.). Business analysts use Reporting Analytics
to compare measured performance against objectives. They use Reporting Analytics only to understand the process. The
intelligence gathered from Reporting Analytics cannot be used to influence process optimizations or to adjust the overall
strategy. Process tuning or strategy adjustments are the subject of one of the next two types of analytics.




                                                                   5
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness




Figure 3: Business Analytics Categories

Forecast Analytics uses data mining algorithms to process historical data (“Report“ in Figure 3) and derive insights of statistical
significance. These insights are then used to define a set of rules (or mathematical models) called “forecast models,” which are
nothing but mathematic formulas. These models are being iterated (“Model & Simulate” in Figure 3) until the model with the
best outcome wins. Forecast Analytics helps analysts optimize the process within prescribed boundaries. Practitioners can
tune the process, for example by adopting automation which is fundamentally the first step toward intelligent processes.
Forecast Analytics’ primary role is to influence optimizations needed to tune a process; however it doesn’t provide the analyst
with the insights needed to make strategy adjustments.

Predictive Analytics offers the greatest opportunity to influence the strategy from which business objectives will be born.
Predictive Analytics begins with historic facts, takes in consideration data mining and fast-tracks forecast models definition
and validation. Predictive Analytics looks at the strategy and its derived processes holistically (“Predict” in Figure 3).

Let’s look at an example. We’ll consider the case of a home improvement company. Historical data indicates that ant killer
sells very well across southern U.S. during summer months. Historical data also indicates that shelf inventory sells very slowly
and at deep discounts after Labor Day. This year the company wants to make sure there is no shelf inventory come Labor Day.
Also the ant killer manufacturer has announced a new product that combines the ant killer with a lawn fertilizer. How can
analytics help?

Foremost, the company needs to start with Reporting Analytics to understand factors like volume of sales per month, geo-
distribution across the region, sales volume for each sales representative, discounts after Labor Day, etc. Second, the company
needs to consider Forecast Analytics to simulate various sales scenarios and choose the one that meets the strategic
criterion—no inventory left come Labor Day. The results may include: accelerate sales in July and August using coupons, hire
more sales representatives to “push” the inventory quicker, etc. Third, the company needs to use Predictive Analytics to
identify the best strategy for selling the new product. Contributing factors to the new strategy may be not only the ant killer
sales figures but also information like excessive drought zones (in these areas homeowners need both bug killers and
fertilizers to keep their lawns bug free and beautiful during summer months), single-home ownership rates, demographic
characteristics, social networks, etc.

To summarize, the three categories of analytics build on each other. They all attack the same problem, though they do it at
different levels and take a different view. It all starts with historical data, which is what Reporting Analytics is concerned with.
Next comes Forecast Analytics, which has the power to influence the outcome of the interaction with the customer. Forecast
Analytics shows us a glimpse into the future, though it is very narrow because it is based on limited insight. Predictive Analytics
really opens the window into the future and lets us choose if we like it or not.




                                                                     6
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

Great, I understand it now! What about these exponentially growing volumes of data? Would analytics scale?
An emerging trend that begins disrupting traditional analytics is the ever-increasing amount of mostly unstructured data that
organizations need to store and process. Tagged as big data, it refers to the means by which an organization can create,
manipulate, store, and manage extremely large data sets (think tens of petabytes of data). Difficulties include capture, storage,
search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger
datasets, which allow analysts to gain insights never possible before. [8] [10]

Big data analytics require technologies like MPP (massively parallel processing) to process large quantities of data. Examples of
organizations with large quantities of data are the oil and gas companies that gather huge amounts of geophysical data.

Two chief concerns of big data analytics are linear scalability and efficient data processing.[9] Nobody wants to start down the
big data analytics path and realize that in order to keep up with data growth the solution needs armies of administrators.

In short, leveraging big data analytics in the enterprise presents both benefits and challenges. From the holistic perspective,
big data analytics enable businesses to build processes that encompass a variety of value streams (customers, business
partners, internal operations, etc.). The technology offers a much broader set of data management and analysis capabilities
and data consumption models. For example, the ability to consolidate data at scale and increase visibility into it has been a
desire of the business community for years. Technologies like Hadoop finally make it possible. Businesses no longer need to
skip on reporting and insights simply because the technology is not capable or it is too expensive.


Case study: Using big data analytics to optimize/automate IT operations
                  Steve: “What was wrong with the server that crashed last week?”

                  Bill: “I don’t know. I rebooted it and it’s just fine. Perhaps the software crashed.”

Anyone who has been in IT operations must have had the above dialog, sometimes quite often. Today’s data centers generate
immense quantities of data, and the answer to the above question lies in IT’s ability to mine the data and uncover the chain of
events.

                                                 IT operations are a crucial aspect of most organizational operations.
                                                 Companies rely on their information systems to run their operations. IT must
                                                 therefore keep high standards for assuring business continuity in spite of
                                                 hardware or software glitches, network connectivity disruptions, unreliable
                                                 power systems, etc.

                                                 Effective IT operations require a balanced investment in both system data
                                                 gathering and data analysis. Most IT operations nowadays gather up-to-the-
                                                 minute (or second in some cases) logs from the servers, storage devices,
                                                 network components, applications running on this infrastructure (i.e. Linux
                                                 system log), and even the power and cooling components.

The data lifecycle (Figure 4) begins with the data being generated and collected. The vast majority of the collected data
consists of plain text files that have very little in common in the way the content is structured. Data can be stored in its original
format or it can be pre-processed and then stored. Pre-processing increases the value of the data by removing less significant
content. The data is then stored and made available for processing.

Processing of the data is mainly focused on two attributes:
       Extract the value (also called insights) from the data through the use of statistical analysis

        Make the results available for presentation in a format that readily communicates the value




Figure 4: Data Lifecycle




                                                                      7
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

The last phase of the data lifecycle is the presentation of the insights uncovered along the way. At this phase data reaches its
maximum potential and has the biggest impact on decisions derived from analysis. In the broad spectrum of options,
presentation may imply graphic presentation of the results (i.e. pie chart) or only bundling the results and shipping them off to
an application for further examination.

Big data analytics can help optimize/automate IT operations in several ways:
    Improve the quality of the control processes by embedding big data analytics in the control path
    Keep the system operating within set boundaries by being able to predict the future operational state of the system
    Minimize system downtime by avoiding predictable failures

Figure 5 illustrates an example of embedding analytics in the control loop of the data center management system.

As explained above, system components (hardware or software) generate metering data that is readily available on a system-
wide data bus. The analytics engine grabs the metering data from the data bus, processes it, and examines the results against
historical data (i.e. data that was gathered in a previous iteration). Next, the analytics engine computes the deviation and
compares it with the standard deviation defined in profiles. The analytics engine forwards the comparison results to the
intelligent controller, which, after evaluating the particular condition against pre-defined policies, issues control commands
back to the system.




Figure 5: Embedding Analytics in Automated System Control

The control system described above allows IT managers to rethink the operational efficiency of the data center. By harnessing
the power of sophisticated analytics, the system’s response can be correlated in a timely manner with the control stimuli and
external factors over a broad spectrum of conditions and application workloads. IT managers can optimize the system for the
supply side (i.e. utilities), for the demand side (i.e. software applications, business processes, etc.) or for both. The long-term
payoffs should outweigh the cost of analytics.




                                                                     8
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

Big data analytics challenges in the enterprise
The adoption of big data analytics in the enterprise can deliver huge benefits but also presents equally important challenges.
Adopting big data analytics is both an opportunity and a challenge. Examples are in order:
      An inability to share/correlate knowledge (data and algorithms) across organizational boundaries impacts the
       bottom line.
       As mentioned above, analytics are converging. Two or more business units may be working on a similar set of
       challenges. With no leveraged knowledge among them, each business unit will duplicate efforts only to discover similar
       solutions. Sharing the value of big data underpins substantial productivity gains and accelerates innovation.
      Data is locked in many disparate data marts.
       This is not necessarily a new challenge. This has been seen since the early days of enterprise databases when two or
       more departments could not agree on a common set of requirements and decided to go their own ways and build
       separate data stores. The advent of big data exacerbates the age-old dispute—the sheer volume of data requires even
       more data marts to be stored. Big data mitigates this challenge by leveraging technologies that are built from the
       ground up to be scalable and schema-agnostic.
      Traditional enterprise IT processes (i.e. user authentication and authorization) don’t scale with big data.
       Not being able to enforce and audit access controls against huge quantities of data leaves the enterprise open to
       unauthorized access and theft of the intellectual property.


The adoption of Hadoop technology
Hadoop has become the most widely known big data technology implementation. The rise of Hadoop proved to be
unstoppable. There is a very vibrant community around Hadoop. Venture capitalists are pouring money into startups much like
we saw back in 2000 with Internet companies. Most of these startups are enacted as academic research projects. Customer
demand eventually brings them into the mainstream marketplace where they start competing with more established
providers.

On the receiving end of the market, businesses start picking up the pace at which Hadoop is deployed. Businesses begin to
realize that data management, processing, and consumption are emerging as key challenges.

The wide adoption of Hadoop is hindered by both socio-business and technical factors.

Examples of socio-business factors are:
    Hiring
     Just like with any high-end niche technology, the emergence of Hadoop requires bleeding-edge data analytics design,
     processing, and visualization skills. For example, the Hadoop MapReduce API is more complex than SQL. Managing
     Hadoop deployments is equally complex. These skill sets are in short supply, thus slowing down the adoption of the
     technology. Hiring will get better as tools and the underlying technology improves.
      Confusion among vendors as well as buyers
       The rapidly changing market landscape makes it difficult for technology innovators to forecast resource allocation and
       maximize their returns on investments. Buyers are equally confused because they need more information about the
       actual business value of the technology and about the costs and the characteristics of successful deployments.
       Companies like Dell are taking a customer-centric approach. They work directly with customers and vendors to ease
       the adoption of the technology by providing end-to-end Hadoop solutions and business value metrics, all wrapped in
       strong services and consulting offerings.

      The “checkbox” mentality and the genesis of a new form of vendor lock-in
       Traditional enterprises demand their IT organizations to require support contracts for all their software applications. The
       “checkbox” mentality is one in which support is provided so IT can mark off the appropriate checkbox. Yet, businesses
       realize that true opportunities to improve the bottom line come from a deeper understanding of their internal
       processes; thus demand for big data is rapidly increasing. That leaves IT with only one option. That is, choose one from
       many competing vendors. Because of fierce competition among vendors, the one chosen vendor will try to lock in as
       much functionality as possible. The answer is a leveraged approach: use open source as much as possible and pay only
       for the support that is deemed absolutely necessary. Look for vendors that offer both open-source and commercial
       versions of the technology needed. A different, yet long-term, answer is standardization (i.e. of the API, the data
       models, the algorithms, etc.)


                                                                    9
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

Hadoop technical strengths and weaknesses
Hadoop has been designed from the ground up for seamless scalability and massively parallel compute and storage. Hadoop
has been optimized for high aggregated data throughput (as opposed to query latency). The real power of Hadoop is in the
number of compute nodes in the cluster instead of the compute and store capacity of each individual node.

Hadoop’s strengths are:
   It is highly scalable—Yahoo runs Hadoop on thousands of nodes
   It integrates storage and compute—the data is processed right where it is stored
   It supports a broad range of data formats (CSV, XML, XSL, GIF, JPEG, SAM, BAM, TXT, JSON, etc.).
   Data doesn’t have to be “normalized” before it is stored in Hadoop.

Examples of the Hadoop’s weaknesses are:
    Security—Hadoop has a fairly incoherent security design. Data access controls are implemented at the lowest level of
     the stack (the file system on each compute node). Also there is no binding between data access and job access models.
    Advanced IT operations and developer skills are required.
    Lack of enterprise hardening—the NameNode is a single point of failure.


Dell | Hadoop solutions
The Dell | Hadoop solutions lower the barrier to adoption for businesses looking to use Hadoop in production. Dell’s
customer-centered approach is to create rapidly deployable and highly optimized end-to-end Hadoop solutions running on
commodity hardware. Dell provides all the hardware and software components and resources to meet the customer’s
requirements and no other supplier need be involved.

The hardware platforms for the Dell | Hadoop solutions (Figure 6) are the Dell™ PowerEdge™ C Series and Dell™
PowerEdge™ R Series. Dell PowerEdge C Series servers are focused on hyperscale and cloud capabilities. Rather than
emphasizing gigahertz and gigabytes, these servers deliver maximum density, memory, and serviceability while minimizing
total cost of ownership. It’s all about getting the processing customers need in the least amount of space and in an energy-
efficient package that slashes operational costs. Dell PowerEdge R Series servers are widely popular with a variety of
customers for their ease of management, virtually tool less serviceability, power and thermal efficiency, and customer-inspired
designs. Dell PowerEdge R Series servers are multi-purpose platforms designed to support multiple usage models/workloads
for customers who want to minimize differing hardware product types in their environments.

The operating system of choice for the Dell | Hadoop solutions is Linux (i.e. Red Hat Enterprise Linux, CentOS, etc.). The
recommended Java Virtual Machine (JVM) is the Oracle Sun JVM.

The hardware platforms, the operating system, and the Java Virtual Machine make up the foundation on which the Hadoop
software stack runs.




Figure 6: Dell | Hadoop Solution Taxonomy
                                                                   10
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

The bottom layer of the Hadoop stack (Figure 6) comprises two frameworks:
    1. The Data Storage Framework (HDFS) is the filesystem that Hadoop uses to store data on the cluster nodes. Hadoop
        Distributed File System (HDFS) is a distributed, scalable, and portable filesystem.
    2. The Data Processing Framework (MapReduce) is a massively parallel compute framework inspired by Google’s
        MapReduce papers.

The next layer of the stack in the Dell | Hadoop solution design is the network layer. Dell recommends implementing the Hadoop
cluster on a dedicated network for two reasons:
     1. Dell provides network design blueprints that have been tested and qualified.
     2. Network performance predictability—sharing the network with other applications may have a detrimental impact on
         the performance of the Hadoop jobs.

The next two frameworks—the Data Access Framework and the Data Orchestration Framework—comprise utilities that are
part of the Hadoop ecosystem.

Dell listened to its customers and designed a Hadoop solution that is fairly unique in the marketplace. Dell’s end-to-end
solution approach means that the customer can be in production with Hadoop in shortest time possible. The Dell | Hadoop
solutions embody all the software functions and services needed to run Hadoop in a production environment. The customer
is not left wondering, “What else is missing?” One of Dell’s chief contributions to Hadoop is a method to rapidly deploy and
integrate Hadoop in production. Other major contributions include integrated backup, management, and security functions.
These complementary functions are designed and implemented side-by-side with the core Hadoop core technology.

Installing and configuring Hadoop is non-trivial. There are different roles and configurations that need to deployed on various
nodes. Designing, deploying, and optimizing the network layer to match Hadoop’s scalability requires a lot of thinking and
also consideration for the type of workloads that will be running on the Hadoop cluster. The deployment mechanism that Dell
designed for Hadoop automates the deployment of the cluster from “bare-metal” (no operating system installed) all the way
to installing and configuring the Hadoop software components to specific customer requirements. Intermediary steps include
system BIOS update and configuration, RAID/SAS configuration, operating system deployment, Hadoop software deployment,
Hadoop software configuration, and integration with the customer’s data center applications (i.e. monitoring and alerting).

Data backup and recovery is another topic that was brought up during customer roundtables. As Hadoop becomes the de
facto platform for business-critical applications, the data that is stored in Hadoop is crucial for ensuring business continuity.
Dell’s approach is to offer several enterprise-grade backup solutions and let the customer choose.

Customers also commented on the current security model of Hadoop. It is a real concern because as a larger number of
business users share access to exponentially increasing volumes of data, the security designs and practices need to evolve to
accommodate the scale and the risks involved. Also HIPAA, Sarbanes-Oxley, SAS70, and PCI Security Standards Council may
have an interest in data stored in Hadoop. Particularly in industries like healthcare and financial services, access to the data has
to be enforced and monitored across the entire stack. Unfortunately, there is no clear answer on how the security
architecture of Hadoop is going to evolve. Dell’s approach is to educate the customer and also work directly with leading
vendors to deliver a model that suits the enterprise.

Lastly, Dell’s open, integrated approach to enterprise-wide systems management enables customers to build comprehensive
system management solutions based on open standards and integrated with industry-leading partners. Instead of building a
patchwork of solutions leading to systems management sprawl, Dell integrates the management of the Dell hardware running
the Hadoop cluster with the “traditional” Hadoop management consoles (Ganglia, Nagios).

To summarize, Dell is adding Hadoop to its data analytics solutions portfolio. Dell’s end-to-end solution approach means that
Dell will provide readily available software interfaces for integration between the solutions in the portfolio. Dell will provide the
ETL connector (Figure 6) that integrates Hadoop with the Dell | Aster Data solution.




                                                                     11
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

Dell | Hadoop for the enterprise
In this section we introduce several best practices for deploying and running Hadoop in an enterprise environment:
     Hardware selection
     Integrating Hadoop with Enterprise Data Warehouse (data models, data governance, design optimization)
     Data security
     Backup and recovery

The focus in the paper is only on the introduction and high-level overview of these best practices. Our goal is to raise the
awareness among enterprise practitioners and help them create successful Hadoop-based designs. We leave the
implementation to be presented in an upcoming white paper titled Hadoop Enterprise How-To published in the same series.

The inherent challenge with recommendations for Hadoop in the enterprise is that there is not a lot of published research to
draw on. However, Dell has a very strong practice in defining and implementing best practices for its enterprise customers.
Thus, we had to take a different approach. Namely, we began with a gap analysis of Hadoop and drew on our enterprise
practice to derive recommendations that are likely to have the most profound impact on building Hadoop solutions for the
enterprise.

As mentioned above, we intentionally left the details for additional white papers because we did not want to run the risk of
making this high-level outline overly complex and, thus, fail to meet the original goal, which was to raise awareness.

Let’s now look at what it takes to run Hadoop in the enterprise.

First off, we’ve been using clustering technologies like HPCC in the enterprise for years. How is Hadoop different from
HPCC?
The main difference between high-performance computuing (HPC) and Hadoop is in the way the compute nodes in the
cluster access the data that they need to process. Traditional HPC architectures employ a shared-disk setup—all compute
nodes process data loaded in a shared network storage pool. Network latency and disk bandwidth become the critical factors
for HPC job performance. Therefore, low-latency network technologies (like InfiniBand) are commonly deployed in HPC.
Hadoop uses a shared-nothing architecture—data is distributed and copied locally on each compute node. Hadoop does not
need low-latency network; therefore using cheaper Ethernet networks for Hadoop clusters is the common practice for the
vast majority of Hadoop deployments. [11]

Got it! Let’s now look at the hardware. Is there anything I should be concerned with?
The quick answer is YES. First and foremost, standardization is key. Using the same server platform for all Hadoop nodes can save
considerable money and allows for faster deployments. Other best practices for hardware selection include:
    Use commodity hardware—Commodity hardware can be re-assigned between applications as needed. Specialized
      hardware cannot be moved that easily.
    Purchase full racks—Hadoop scales very well with number of racks, so why not let Dell do the rack-n-stack and wheel
      in the full rack?
    Abstract the network and naming—Any IP addressing scheme, no matter how complex or laborious, can scale to only a
      few hundred nodes. Using DNS and CNAMEs scales much better.

Okay, I got the racks in production. How do I exchange data between Hadoop and my data marts?
The answer varies depending on who is asking the question.

To an IT architect, this is a typical system integration challenge. That is, there are two systems (Hadoop and the data mart) that
need to be integrated with each other. For example, the IT architect would have to design the network connectivity between
the two systems. Figure 7 illustrates a possible network connectivity design.




                                                                    12
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness




Figure 7: Example of Network Connectivity between a Hadoop Cluster and a Data Mart

To a data analyst, this is a data pipeline design challenge (Figure 8). His chief concerns are data formatting, availability of data
for processing and analysis, query performance, etc. The data analyst doesn’t need to know the topology of the network
connectivity between the Hadoop cluster and the particular data mart.

The difference between the two perspectives could hardly be greater.

The solution is a mix of IT best practices and database administration best practices. The details are covered in an upcoming
white paper, titled Integrating Hadoop and Data Warehouse, published in this same series of papers.




Figure 8: Example of Data Pipeline between Hadoop and Data Warehouse




                                                                     13
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

Great, I now have data in Hadoop! How should I secure the access to it?
Out of all the technical challenges that Hadoop exhibits, the security model is likely to be the biggest obstacle for the
adoption of Hadoop in the enterprise. Hadoop relies on Linux user permissions for data access. These user permissions are
enforced only at the lowest level of the stack (the HDFS layer on each compute node) instead of being checked and enforced
at the metadata layer (the NameNode) or higher. Jobs use the same userID to get access to data stored in Hadoop. A person
skilled in the art can deploy a man-in-the-middle or denial-of-service attack.

It should be noted that both Yahoo and Cloudera are making intense efforts to bring Hadoop’s security in line with the enterprise.
Meanwhile, the security best practices include:
     Ensure strong perimeter security—for example, use strong authentication and encryption for all network access to the
       Hadoop cluster.
     Use Kerberos inside Hadoop for user authentication and authorization.
     If purchasing support from Cloudera is an option, use Cloudera Enterprise to streamline the management of the
       security functions across all the machines in the cluster.

Great, I’ll pay close attention to security! Last question: how do I back up the data in Hadoop?
Again, it depends on who is asking.

The IT administrator would be concerned about backup policies, media management, etc.

The data analyst wants to make sure that the data has been saved entirely, which means that the backup solution needs to be
data-aware. Sometimes a dataset may be composed of more than one file. Any file in Hadoop is broken down in a number of
blocks that are handed off to Hadoop nodes for storage. A file-aware (or even worse, block-aware) backup solution will not
maintain the dataset metadata (the association rules between files) which will render the dataset completely useless.

The intersection between the two views is the vision for Hadoop data backup. The best practices include:
    Decide where the data is backed up: NAS, SAN, cloud, or another Hadoop cluster. While using the cloud for backing up
       the data makes perfect sense, most of the enterprises tend to keep the data private within the corporate firewall. Saving
       the data to another Hadoop cluster also makes sense; however the destination Hadoop cluster will need a backup
       solution of its own. Realistically, there are only two options for backup: NAS and SAN. If the backup needs only volume
       and average performance is acceptable, then the answer is NAS. For best-in-class performance and undisrupted access
       requirements the answer is SAN.
    Dedupe your data.
    Prioritize your data—back up only the data that is deemed valuable.
    Add dataset metadata awareness to the backup.
    Establish backup policies for both metadata and actual data.

Great, thanks, that makes sense! What do I do if I have questions?
First, please don’t hesitate to contact the author—contact information is provided below. Second, Dell offers a broad variety of
consulting, support, and training services for Hadoop. Your Dell sales representative can put you in touch with the Dell
Services team.




                                                                  14
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

About the author
Aurelian “A.D.” Dumitru is the Dell | Hadoop chief architect. In that role he is responsible for all architecture decisions and
long-term strategy for Hadoop. A.D. has over 20 years of experience. He has been with Dell for more than 11 years in various
engineering, architecture, and management positions. His background is in hyperscale massively parallel compute systems.
His interests are in automated process control, intelligent processes, and machine learning. Over the years he has authored or
made significant contributions to more than 20 patent applications, from RFID and automated process controls to software
security and mathematical algorithms. For similar topics please check his personal blog www.RationalIntelligence.com.


Special thanks
The author wishes to thank Nicholas Wakou, Howard Golden, Thomas Masson, Lee Zaretsky, Joey Jablonski, Scott Jensen,
John Igoe, and Matthew McCarthy for their helpful comments.


About Dell Next Generation Computing Solutions
When cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell Next
Generation Computing Solutions are Dell’s response to your unique needs. We understand your challenges—from compute
and power density to global scaling and environmental impact. Dell has the knowledge and expertise to tune your company’s
“factory” for maximum performance and efficiency.

Dell’s Next Generation Computing Solutions provide operational models backed by unique product solutions to meet the
needs of companies at all stages of their lifecycles. Solutions are designed to meet the needs of small startups while allowing
scalability as your company grows.

Deployment and support are tailored to your unique operational requirements. Dell’s Cloud Computing Solutions can help
you minimize the tangible operating costs that have hyperscale impact on your business results.


Hadoop ecosystem component “decoder ring”
    1.  Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application
        data
    2. MapReduce: a software framework for distributed processing of large data sets on compute clusters
    3. Avro: a data serialization system
    4. Chukwa: a data collection system for managing large distributed systems
    5. HBase: a scalable, distributed database that supports structured data storage for large tables
    6. Hive: a data warehouse infrastructure that provides data summarization and ad-hoc querying
    7. ZooKeeper: a high-performance coordination service for distributed applications
    8. Pig: a platform for analyzing large data sets that consists of high-level language for expressing data analysis
        programs, coupled with infrastructure for evaluating these programs.
    9. Sqoop (from Cloudera): a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to
        connect to a database.
    10. Flume (from Cloudera): a distributed service for collecting, aggregating and moving large amounts of log data. Its
        architecture is based on streaming data flows.

    (Source: http://hadoop.apache.org/)




                                                                  15
Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness

Bibliography
[1] Donald F. Ferguson et al. Enterprise Business Process Management—Architecture, Technology and Standards. Lecture
Notes on Computer Science 4102, 1-15, 2006

[2] Andrew Spanyi, Business Process Management (BPM) is a Team Sport: Play it to Win! Meghan-Kiffer Press, June 2003, ISBN
978-0929652023

[3] http://en.wikipedia.org/wiki/Business_process_management

[4] David W. McCoy, Business Activity Monitoring: Calm Before the Storm, Gartner 2002,
http://www.gartner.com/resources/105500/105562/105562.pdf

[5] http://en.wikipedia.org/wiki/Process_mining

[6] http://www.bpminstitute.org/articles/article/article/bringing-analytics-into-processes-using-business-rules.html

[7] http://en.wikipedia.org/wiki/Convergent_evolution

[8] http://en.wikipedia.org/wiki/Big_data

[9] http://www.asterdata.com/blog/2008/05/19/discovering-the-dimensions-of-scalability/

[10] McKinsey Global Institute, Big data: The next frontier for innovation, competition, and productivity, May 2011

[11] S. Krishnan et al., myHadoop—Hadoop-on-demand on Traditional HPC Resources, University of California at San Diego,
2010




     To learn more
     To learn more about Dell cloud solutions, contact your Dell
     representative or visit:
     www.dell.com/hadoop



©2011 Dell Inc. All rights reserved. Trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Specifications are
correct at date of publication but are subject to availability or change without notice at any time. Dell and its affiliates cannot be responsible for errors or omissions in typography or photography.
Dell’s Terms and Conditions of Sales and Service apply and are available on request. Dell service offerings do not affect consumer’s statutory rights.

Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.




                                                                                                   16

Mais conteúdo relacionado

Mais procurados

Project and portfolio management
Project and portfolio managementProject and portfolio management
Project and portfolio managementLilian Schaffer
 
RCM SharePoint Quick Start
RCM SharePoint Quick StartRCM SharePoint Quick Start
RCM SharePoint Quick StartDave Schmidt
 
Drive Cost Optimization with Open iT Software Metering Tools
Drive Cost Optimization with Open iT Software Metering ToolsDrive Cost Optimization with Open iT Software Metering Tools
Drive Cost Optimization with Open iT Software Metering ToolsOpen iT Inc.
 
Compegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_Preso
Compegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_PresoCompegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_Preso
Compegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_PresoCOMPEGENCE
 
Deliver a Successful ECM Project
Deliver a Successful ECM ProjectDeliver a Successful ECM Project
Deliver a Successful ECM ProjectNuxeo
 
Business Services
Business ServicesBusiness Services
Business ServicesCarol Jones
 
Optimize Asset Value and Performance with Enterprise Content Management
Optimize Asset Value and Performance with Enterprise Content ManagementOptimize Asset Value and Performance with Enterprise Content Management
Optimize Asset Value and Performance with Enterprise Content ManagementSAP Solution Extensions
 
Value added socialnetwork-bk (1)
Value added socialnetwork-bk (1)Value added socialnetwork-bk (1)
Value added socialnetwork-bk (1)Dr. Paul Gromball
 
DKSH expands its business with a highly scalable solution from SAP and IBM
DKSH expands its business with a highly scalable solution from SAP and IBMDKSH expands its business with a highly scalable solution from SAP and IBM
DKSH expands its business with a highly scalable solution from SAP and IBMIBM India Smarter Computing
 
Delphix informatica-case-study
Delphix informatica-case-studyDelphix informatica-case-study
Delphix informatica-case-studysubramani_ts
 
Breakthrough reporting, analysis and planning tools for midsize companies.
Breakthrough reporting, analysis and planning tools for midsize companies.Breakthrough reporting, analysis and planning tools for midsize companies.
Breakthrough reporting, analysis and planning tools for midsize companies.IBM Business Insight
 
SAP Cloud Strategy Keynote Sven Denecken
SAP Cloud Strategy Keynote Sven DeneckenSAP Cloud Strategy Keynote Sven Denecken
SAP Cloud Strategy Keynote Sven DeneckenSven Denecken
 
Microsoft Unified Communications - Delivering an End to End Unified Communica...
Microsoft Unified Communications - Delivering an End to End Unified Communica...Microsoft Unified Communications - Delivering an End to End Unified Communica...
Microsoft Unified Communications - Delivering an End to End Unified Communica...Microsoft Private Cloud
 
Smarter Planet V3
Smarter Planet V3Smarter Planet V3
Smarter Planet V3Mike Handes
 

Mais procurados (19)

Rcm bi low
Rcm bi lowRcm bi low
Rcm bi low
 
Project and portfolio management
Project and portfolio managementProject and portfolio management
Project and portfolio management
 
RCM SharePoint Quick Start
RCM SharePoint Quick StartRCM SharePoint Quick Start
RCM SharePoint Quick Start
 
How to implement ECM?
How to implement ECM?How to implement ECM?
How to implement ECM?
 
[StepTalks2011] Agility @ Scale - Rien Schot
[StepTalks2011] Agility @ Scale - Rien Schot[StepTalks2011] Agility @ Scale - Rien Schot
[StepTalks2011] Agility @ Scale - Rien Schot
 
Streamlining ssc operations for multiple processes
Streamlining ssc operations for multiple processesStreamlining ssc operations for multiple processes
Streamlining ssc operations for multiple processes
 
Drive Cost Optimization with Open iT Software Metering Tools
Drive Cost Optimization with Open iT Software Metering ToolsDrive Cost Optimization with Open iT Software Metering Tools
Drive Cost Optimization with Open iT Software Metering Tools
 
Compegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_Preso
Compegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_PresoCompegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_Preso
Compegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_Preso
 
Deliver a Successful ECM Project
Deliver a Successful ECM ProjectDeliver a Successful ECM Project
Deliver a Successful ECM Project
 
Business Services
Business ServicesBusiness Services
Business Services
 
Optimize Asset Value and Performance with Enterprise Content Management
Optimize Asset Value and Performance with Enterprise Content ManagementOptimize Asset Value and Performance with Enterprise Content Management
Optimize Asset Value and Performance with Enterprise Content Management
 
UPES-First Indian University to implement SAP
UPES-First Indian University to implement SAPUPES-First Indian University to implement SAP
UPES-First Indian University to implement SAP
 
Value added socialnetwork-bk (1)
Value added socialnetwork-bk (1)Value added socialnetwork-bk (1)
Value added socialnetwork-bk (1)
 
DKSH expands its business with a highly scalable solution from SAP and IBM
DKSH expands its business with a highly scalable solution from SAP and IBMDKSH expands its business with a highly scalable solution from SAP and IBM
DKSH expands its business with a highly scalable solution from SAP and IBM
 
Delphix informatica-case-study
Delphix informatica-case-studyDelphix informatica-case-study
Delphix informatica-case-study
 
Breakthrough reporting, analysis and planning tools for midsize companies.
Breakthrough reporting, analysis and planning tools for midsize companies.Breakthrough reporting, analysis and planning tools for midsize companies.
Breakthrough reporting, analysis and planning tools for midsize companies.
 
SAP Cloud Strategy Keynote Sven Denecken
SAP Cloud Strategy Keynote Sven DeneckenSAP Cloud Strategy Keynote Sven Denecken
SAP Cloud Strategy Keynote Sven Denecken
 
Microsoft Unified Communications - Delivering an End to End Unified Communica...
Microsoft Unified Communications - Delivering an End to End Unified Communica...Microsoft Unified Communications - Delivering an End to End Unified Communica...
Microsoft Unified Communications - Delivering an End to End Unified Communica...
 
Smarter Planet V3
Smarter Planet V3Smarter Planet V3
Smarter Planet V3
 

Destaque

CUF30107 Mod AA2 OHS Presentation
CUF30107 Mod AA2 OHS PresentationCUF30107 Mod AA2 OHS Presentation
CUF30107 Mod AA2 OHS Presentationamcmills
 
Progetto SOFT
Progetto SOFT Progetto SOFT
Progetto SOFT agaravelli
 
Understanding spreadsheets2
Understanding spreadsheets2Understanding spreadsheets2
Understanding spreadsheets2LearnIT@UD
 
Legge pisanu: cosa decade e cosa resta in vigore
Legge pisanu: cosa decade e cosa resta in vigoreLegge pisanu: cosa decade e cosa resta in vigore
Legge pisanu: cosa decade e cosa resta in vigoreCouncil of Europe
 
Vintalk I Pitomy May 2011
Vintalk I Pitomy May 2011Vintalk I Pitomy May 2011
Vintalk I Pitomy May 2011Vintalk
 
Trend2014del1 slut.key
Trend2014del1 slut.keyTrend2014del1 slut.key
Trend2014del1 slut.keyGoran Adlen
 
De Ander, ééN En ééN Is Drie
De Ander, ééN En ééN Is DrieDe Ander, ééN En ééN Is Drie
De Ander, ééN En ééN Is Driefiore3606
 
гусеница гимнастика для рук
гусеница гимнастика для рукгусеница гимнастика для рук
гусеница гимнастика для рукguestddbae10
 
Obvious Expert Social Media
Obvious Expert Social MediaObvious Expert Social Media
Obvious Expert Social MediaPamela Latham
 
გამტარის წინაღობა
გამტარის წინაღობაგამტარის წინაღობა
გამტარის წინაღობაirakharebava78
 
Caloosa Tech Times October 2009
Caloosa Tech Times October 2009Caloosa Tech Times October 2009
Caloosa Tech Times October 2009ITbyTheSea
 
Technology Action Plan
Technology Action PlanTechnology Action Plan
Technology Action Plandlgoss2005
 
Compendium of Notified Ceiling Prices of Scheduled Drugs - 2015
Compendium of Notified Ceiling Prices of Scheduled Drugs - 2015 Compendium of Notified Ceiling Prices of Scheduled Drugs - 2015
Compendium of Notified Ceiling Prices of Scheduled Drugs - 2015 Anup Soans
 

Destaque (20)

CUF30107 Mod AA2 OHS Presentation
CUF30107 Mod AA2 OHS PresentationCUF30107 Mod AA2 OHS Presentation
CUF30107 Mod AA2 OHS Presentation
 
Progetto SOFT
Progetto SOFT Progetto SOFT
Progetto SOFT
 
Jupiterimages Catalog
Jupiterimages CatalogJupiterimages Catalog
Jupiterimages Catalog
 
Understanding spreadsheets2
Understanding spreadsheets2Understanding spreadsheets2
Understanding spreadsheets2
 
Legge pisanu: cosa decade e cosa resta in vigore
Legge pisanu: cosa decade e cosa resta in vigoreLegge pisanu: cosa decade e cosa resta in vigore
Legge pisanu: cosa decade e cosa resta in vigore
 
Vintalk I Pitomy May 2011
Vintalk I Pitomy May 2011Vintalk I Pitomy May 2011
Vintalk I Pitomy May 2011
 
Trend2014del1 slut.key
Trend2014del1 slut.keyTrend2014del1 slut.key
Trend2014del1 slut.key
 
Bl Matthews CV
Bl Matthews CVBl Matthews CV
Bl Matthews CV
 
De Ander, ééN En ééN Is Drie
De Ander, ééN En ééN Is DrieDe Ander, ééN En ééN Is Drie
De Ander, ééN En ééN Is Drie
 
гусеница гимнастика для рук
гусеница гимнастика для рукгусеница гимнастика для рук
гусеница гимнастика для рук
 
Mayamuscleadvancedtechniques2010
Mayamuscleadvancedtechniques2010Mayamuscleadvancedtechniques2010
Mayamuscleadvancedtechniques2010
 
Obvious Expert Social Media
Obvious Expert Social MediaObvious Expert Social Media
Obvious Expert Social Media
 
გამტარის წინაღობა
გამტარის წინაღობაგამტარის წინაღობა
გამტარის წინაღობა
 
SOFT
SOFTSOFT
SOFT
 
Caloosa Tech Times October 2009
Caloosa Tech Times October 2009Caloosa Tech Times October 2009
Caloosa Tech Times October 2009
 
Elvis Persely Disc60s
Elvis Persely Disc60sElvis Persely Disc60s
Elvis Persely Disc60s
 
Technology Action Plan
Technology Action PlanTechnology Action Plan
Technology Action Plan
 
Delilah
DelilahDelilah
Delilah
 
Mother
MotherMother
Mother
 
Compendium of Notified Ceiling Prices of Scheduled Drugs - 2015
Compendium of Notified Ceiling Prices of Scheduled Drugs - 2015 Compendium of Notified Ceiling Prices of Scheduled Drugs - 2015
Compendium of Notified Ceiling Prices of Scheduled Drugs - 2015
 

Semelhante a Hadoop Enterprise Readiness

Building the Agile Enterprise
Building the Agile EnterpriseBuilding the Agile Enterprise
Building the Agile EnterpriseSrini Koushik
 
Hadoop in the Enterprise
Hadoop in the EnterpriseHadoop in the Enterprise
Hadoop in the EnterpriseJoey Jablonski
 
What is DataOps_ - Bahaa Al Zubaidi.pdf
What is DataOps_ - Bahaa Al Zubaidi.pdfWhat is DataOps_ - Bahaa Al Zubaidi.pdf
What is DataOps_ - Bahaa Al Zubaidi.pdfBahaa Al Zubaidi
 
20121018 The SharePoint Maturity Model - as presented 10/18/12 to the SharePo...
20121018 The SharePoint Maturity Model - as presented 10/18/12 to the SharePo...20121018 The SharePoint Maturity Model - as presented 10/18/12 to the SharePo...
20121018 The SharePoint Maturity Model - as presented 10/18/12 to the SharePo...Sadalit Van Buren
 
The Alignment-Focused Organization: Bridging the Gap Between Strategy and Exe...
The Alignment-Focused Organization: Bridging the Gap Between Strategy and Exe...The Alignment-Focused Organization: Bridging the Gap Between Strategy and Exe...
The Alignment-Focused Organization: Bridging the Gap Between Strategy and Exe...FindWhitePapers
 
GRI Conference, 27 May, Peterschmitt - Learn About GRI Certified Software...
GRI Conference, 27 May, Peterschmitt - Learn  About  GRI  Certified  Software...GRI Conference, 27 May, Peterschmitt - Learn  About  GRI  Certified  Software...
GRI Conference, 27 May, Peterschmitt - Learn About GRI Certified Software...Global Reporting Initiative
 
3 Keys To Successful Master Data Management - Final Presentation
3 Keys To Successful Master Data Management - Final Presentation3 Keys To Successful Master Data Management - Final Presentation
3 Keys To Successful Master Data Management - Final PresentationJames Chi
 
121211 depfac ulb_master_presentation_v5_1
121211 depfac ulb_master_presentation_v5_1121211 depfac ulb_master_presentation_v5_1
121211 depfac ulb_master_presentation_v5_1Thibaut De Vylder
 
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...Cloudera, Inc.
 
DBA Role Shift in a DevOps World
DBA Role Shift in a DevOps WorldDBA Role Shift in a DevOps World
DBA Role Shift in a DevOps WorldDatavail
 
2011 sap inside_track_eim_overview
2011 sap inside_track_eim_overview2011 sap inside_track_eim_overview
2011 sap inside_track_eim_overviewMichelle Crapo
 
3 reach new heights of operational effectiveness while simplifying it with or...
3 reach new heights of operational effectiveness while simplifying it with or...3 reach new heights of operational effectiveness while simplifying it with or...
3 reach new heights of operational effectiveness while simplifying it with or...Dr. Wilfred Lin (Ph.D.)
 
Using Dashboards to Monitor Project Performance - Is there a Practical Approach?
Using Dashboards to Monitor Project Performance - Is there a Practical Approach?Using Dashboards to Monitor Project Performance - Is there a Practical Approach?
Using Dashboards to Monitor Project Performance - Is there a Practical Approach?New Mexico Technology Council
 
Implementation demystification 10 keys to a successful p6 implementation wh...
Implementation demystification   10 keys to a successful p6 implementation wh...Implementation demystification   10 keys to a successful p6 implementation wh...
Implementation demystification 10 keys to a successful p6 implementation wh...p6academy
 
SAP CVN Supply Network Planning - Supply Planning Engine Selection
SAP CVN Supply Network Planning - Supply Planning Engine SelectionSAP CVN Supply Network Planning - Supply Planning Engine Selection
SAP CVN Supply Network Planning - Supply Planning Engine SelectionPlan4Demand
 
Open Source Management Conference -Bolzano
Open Source Management Conference -BolzanoOpen Source Management Conference -Bolzano
Open Source Management Conference -BolzanoJeffrey Hammond
 

Semelhante a Hadoop Enterprise Readiness (20)

Building the Agile Enterprise
Building the Agile EnterpriseBuilding the Agile Enterprise
Building the Agile Enterprise
 
Hadoop in the Enterprise
Hadoop in the EnterpriseHadoop in the Enterprise
Hadoop in the Enterprise
 
What is DataOps_ - Bahaa Al Zubaidi.pdf
What is DataOps_ - Bahaa Al Zubaidi.pdfWhat is DataOps_ - Bahaa Al Zubaidi.pdf
What is DataOps_ - Bahaa Al Zubaidi.pdf
 
Hadoop Business Cases
Hadoop Business CasesHadoop Business Cases
Hadoop Business Cases
 
Erp
ErpErp
Erp
 
20121018 The SharePoint Maturity Model - as presented 10/18/12 to the SharePo...
20121018 The SharePoint Maturity Model - as presented 10/18/12 to the SharePo...20121018 The SharePoint Maturity Model - as presented 10/18/12 to the SharePo...
20121018 The SharePoint Maturity Model - as presented 10/18/12 to the SharePo...
 
The Alignment-Focused Organization: Bridging the Gap Between Strategy and Exe...
The Alignment-Focused Organization: Bridging the Gap Between Strategy and Exe...The Alignment-Focused Organization: Bridging the Gap Between Strategy and Exe...
The Alignment-Focused Organization: Bridging the Gap Between Strategy and Exe...
 
GRI Conference, 27 May, Peterschmitt - Learn About GRI Certified Software...
GRI Conference, 27 May, Peterschmitt - Learn  About  GRI  Certified  Software...GRI Conference, 27 May, Peterschmitt - Learn  About  GRI  Certified  Software...
GRI Conference, 27 May, Peterschmitt - Learn About GRI Certified Software...
 
3 Keys To Successful Master Data Management - Final Presentation
3 Keys To Successful Master Data Management - Final Presentation3 Keys To Successful Master Data Management - Final Presentation
3 Keys To Successful Master Data Management - Final Presentation
 
121211 depfac ulb_master_presentation_v5_1
121211 depfac ulb_master_presentation_v5_1121211 depfac ulb_master_presentation_v5_1
121211 depfac ulb_master_presentation_v5_1
 
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
 
DBA Role Shift in a DevOps World
DBA Role Shift in a DevOps WorldDBA Role Shift in a DevOps World
DBA Role Shift in a DevOps World
 
2011 sap inside_track_eim_overview
2011 sap inside_track_eim_overview2011 sap inside_track_eim_overview
2011 sap inside_track_eim_overview
 
3 reach new heights of operational effectiveness while simplifying it with or...
3 reach new heights of operational effectiveness while simplifying it with or...3 reach new heights of operational effectiveness while simplifying it with or...
3 reach new heights of operational effectiveness while simplifying it with or...
 
Using Dashboards to Monitor Project Performance - Is there a Practical Approach?
Using Dashboards to Monitor Project Performance - Is there a Practical Approach?Using Dashboards to Monitor Project Performance - Is there a Practical Approach?
Using Dashboards to Monitor Project Performance - Is there a Practical Approach?
 
Child Wear Ea Blueprint V0.7
Child Wear Ea Blueprint V0.7Child Wear Ea Blueprint V0.7
Child Wear Ea Blueprint V0.7
 
Implementation demystification 10 keys to a successful p6 implementation wh...
Implementation demystification   10 keys to a successful p6 implementation wh...Implementation demystification   10 keys to a successful p6 implementation wh...
Implementation demystification 10 keys to a successful p6 implementation wh...
 
KeyedIn Solutions Intro
KeyedIn Solutions IntroKeyedIn Solutions Intro
KeyedIn Solutions Intro
 
SAP CVN Supply Network Planning - Supply Planning Engine Selection
SAP CVN Supply Network Planning - Supply Planning Engine SelectionSAP CVN Supply Network Planning - Supply Planning Engine Selection
SAP CVN Supply Network Planning - Supply Planning Engine Selection
 
Open Source Management Conference -Bolzano
Open Source Management Conference -BolzanoOpen Source Management Conference -Bolzano
Open Source Management Conference -Bolzano
 

Último

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Último (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Hadoop Enterprise Readiness

  • 1. Hadoop Enterprise Readiness Dell | Hadoop White Paper By Aurelian Dumitru Dell | Hadoop White Paper Series
  • 2. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Table of Contents Introduction 3 Audience 3 The value of big data analytics in the enterprise 3 Case study: Using big data analytics to optimize/automate IT operations 7 Big data analytics challenges in the enterprise 9 The adoption of Hadoop technology 9 Hadoop technical strengths and weaknesses 10 Dell | Hadoop solutions 10 Dell | Hadoop for the enterprise 12 About the author 15 Special thanks 15 About Dell Next Generation Computing Solutions 15 Hadoop ecosystem component “decoder ring” 15 Bibliography 16 To learn more 16 This white paper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any kind. © 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. 2
  • 3. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Introduction This white paper describes the benefits and challenges of leveraging big data analytics in an enterprise environment. The white paper begins with a holistic view of business process phases and highlights ways in which analytics may stimulate better business operational efficiency, drive higher returns from existing or new investments, and also help business leaders make rapid adjustments to the business strategy in response to varying market trends and/or customer demands. The white paper continues with a case study of how big data analytics helps information technology (IT) departments run information systems more efficiently and with little or no downtime. Lastly, the paper introduces the Dell | Hadoop solutions and presents several best practices for deploying and using Hadoop in the enterprise. Audience Dell intends this white paper for anyone in the business or IT community who wants to learn about the advantages and challenges of implementing and using big data analytics solutions (like Hadoop) in a production environment. Readers should be familiar with general concepts of business process design and implementation and also with the correlation between business processes and IT practices. The value of big data analytics in the enterprise Business processes define the way business activities are performed, the expected set of inputs, and the desired outcomes. Business processes often integrate business units, workgroups, infrastructures, business partners, etc. to achieve key performance goals (i.e. strategy, operations, functionality). Business process adjustments and improvements are expected as the company attempts to improve its operations or to create a competitive advantage. Business process maturity and execution excellence are the core competencies of any modern company. Switching from last decade’s product-centric business model to today’s customer-driven model requires reengineering of the business processes (i.e. just-in-time business intelligence) along with deeper collaboration among departments.[2] Enterprise business processes relate to cross-functional management of work activities across the boundaries of the various departments (or business units) within a large enterprise. Controlling the sequence of work activities (and the corresponding information flow) while delivering to customer’s needs is fundamental to the successful implementation and execution of the business process. Because of its intrinsic complexity, enterprises start taking a process-centric approach to designing, planning, monitoring, and automating the business operations. One example of such approach stands out: Business Process Management (BPM). “BPM is a holistic management approach focused on aligning all aspects of an organization with the wants and needs of clients. It promotes business effectiveness and efficiency while striving for innovation, flexibility, and integration with technology. BPM attempts to improve processes continuously.”[3] 3
  • 4. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness The main BPM phases (Figure 1) and their respective owners are: 1. Vision—Functional leads in an organization create the strategic goals for the organization. The vision can be based on market research, internal strategy objectives, etc. 2. Design & Simulate—Design leads in the organization work to identify existing processes or to design “to-be” processes. The result is a theoretical model that is tested against combinations of variables. Special consideration is given to “what if” scenarios. The aim of this step is to ensure that the design delivers the key performance goals established in the Vision phase. 3. Implement—The theoretical design is adopted within the organization. A high degree of automation and integration are two key ingredients for successful implementation. Other key elements may be personnel training, user documentation, streamlined support, etc. 4. Execute—The process is fully adopted within the organization. Special measures and procedures are being put in place to enable the organization to investigate/monitor the execution of the process and test it against established performance objectives. An example of such measures and procedures is what Gartner defines as Business Activity Monitoring (BAM) [4]. 5. Monitor & Optimize—The process is being Figure 1: Business Process Management (BPM) Phases monitored against performance objectives. Actual performance statistics are gathered and analyzed. Example of such statistics can be the measure of how quickly an online order is processed and sent for shipping. In addition, these statistics can be used to work with other organizations to improve their connected processes. The highest possible degree of automation can help tremendously. Automation can cut costs, save time, add value, and eventually lead to competitive advantage. Process Mining [5] is a collection of tools and methods related to process monitoring. How can analytics help the business? In 2005 Gartner released a thought-provoking study about combining business intelligence with a business process platform. This results in what Gartner calls an environment in which processes are self-configuring and driven by clients or transactions. The real challenge with such an endeavor is mapping business rules to intelligent processes that, by definition, need to be self-configurable and transaction-driven. 4
  • 5. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Recent advancements in high-volume data management technologies and data analysis algorithms make the mapping from business rules to intelligent processes plausible. First, analytics enable flow automation and monitoring. Second, removal of manual steps helps improve process reliability and efficiency. Third, analytics can become one of the driving factors for continuous optimizations of business processes in the enterprise. In conclusion, analytics can be the foundation for the environment that Gartner had envisioned (Figure 2). Embedding analytics into the process lifecycle has tremendous benefits. For example, during the Vision phase, functional leads need to understand market trends, customer behavior, internal business challenges, etc. Being able to comb through treasure troves of data quickly and pick the right signals impacts the long-term profitability of the business. Reliance on analytics during the Design & Simulate phase helps the designers rule out suboptimal designs. During the Execute and Monitor & Optimize phases, analytics can provide automation, ongoing performance evaluation, and decision-making. Why can analytics be the business processes foundation? Although analytics use cases vary between each BPM phase, they all seem to answer the same basic questions: What happened? Why did it happen? Will it happen again?—etc. This convergence should be expected. In biology, convergent evolution is a powerful explanatory paradigm. [1] “Convergent Figure 2: BPM + Analytics Environment evolution describes the acquisition of the same biological trait in unrelated lineages. The wing is a classic example. Although their last common ancestor did not have wings, birds and bats do.” [7] A similar phenomenon is occurring in the business analytics world because although different questions demand different answers, the algorithms that generate the answers are fairly similar. The different use cases are converging into three categories of analytics [6] (Figure 3): 1. Reporting Analytics process historic data for purposes of reporting statistical information and interpreting the insights identified by analysing the data 2. Forecast Analytics begins with a summary constructed from historic data and defines scenarios for better outcomes (“Model & Simulate”) 3. Predictive Analytics encompasses the previous two categories and adds strategic decision-making. Reporting Analytics helps analysts characterize the performance of a process instance by aggregating historical data and presenting it in a human-readable interpretation (i.e. spreadsheets, dashboards, etc.). Business analysts use Reporting Analytics to compare measured performance against objectives. They use Reporting Analytics only to understand the process. The intelligence gathered from Reporting Analytics cannot be used to influence process optimizations or to adjust the overall strategy. Process tuning or strategy adjustments are the subject of one of the next two types of analytics. 5
  • 6. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Figure 3: Business Analytics Categories Forecast Analytics uses data mining algorithms to process historical data (“Report“ in Figure 3) and derive insights of statistical significance. These insights are then used to define a set of rules (or mathematical models) called “forecast models,” which are nothing but mathematic formulas. These models are being iterated (“Model & Simulate” in Figure 3) until the model with the best outcome wins. Forecast Analytics helps analysts optimize the process within prescribed boundaries. Practitioners can tune the process, for example by adopting automation which is fundamentally the first step toward intelligent processes. Forecast Analytics’ primary role is to influence optimizations needed to tune a process; however it doesn’t provide the analyst with the insights needed to make strategy adjustments. Predictive Analytics offers the greatest opportunity to influence the strategy from which business objectives will be born. Predictive Analytics begins with historic facts, takes in consideration data mining and fast-tracks forecast models definition and validation. Predictive Analytics looks at the strategy and its derived processes holistically (“Predict” in Figure 3). Let’s look at an example. We’ll consider the case of a home improvement company. Historical data indicates that ant killer sells very well across southern U.S. during summer months. Historical data also indicates that shelf inventory sells very slowly and at deep discounts after Labor Day. This year the company wants to make sure there is no shelf inventory come Labor Day. Also the ant killer manufacturer has announced a new product that combines the ant killer with a lawn fertilizer. How can analytics help? Foremost, the company needs to start with Reporting Analytics to understand factors like volume of sales per month, geo- distribution across the region, sales volume for each sales representative, discounts after Labor Day, etc. Second, the company needs to consider Forecast Analytics to simulate various sales scenarios and choose the one that meets the strategic criterion—no inventory left come Labor Day. The results may include: accelerate sales in July and August using coupons, hire more sales representatives to “push” the inventory quicker, etc. Third, the company needs to use Predictive Analytics to identify the best strategy for selling the new product. Contributing factors to the new strategy may be not only the ant killer sales figures but also information like excessive drought zones (in these areas homeowners need both bug killers and fertilizers to keep their lawns bug free and beautiful during summer months), single-home ownership rates, demographic characteristics, social networks, etc. To summarize, the three categories of analytics build on each other. They all attack the same problem, though they do it at different levels and take a different view. It all starts with historical data, which is what Reporting Analytics is concerned with. Next comes Forecast Analytics, which has the power to influence the outcome of the interaction with the customer. Forecast Analytics shows us a glimpse into the future, though it is very narrow because it is based on limited insight. Predictive Analytics really opens the window into the future and lets us choose if we like it or not. 6
  • 7. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Great, I understand it now! What about these exponentially growing volumes of data? Would analytics scale? An emerging trend that begins disrupting traditional analytics is the ever-increasing amount of mostly unstructured data that organizations need to store and process. Tagged as big data, it refers to the means by which an organization can create, manipulate, store, and manage extremely large data sets (think tens of petabytes of data). Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets, which allow analysts to gain insights never possible before. [8] [10] Big data analytics require technologies like MPP (massively parallel processing) to process large quantities of data. Examples of organizations with large quantities of data are the oil and gas companies that gather huge amounts of geophysical data. Two chief concerns of big data analytics are linear scalability and efficient data processing.[9] Nobody wants to start down the big data analytics path and realize that in order to keep up with data growth the solution needs armies of administrators. In short, leveraging big data analytics in the enterprise presents both benefits and challenges. From the holistic perspective, big data analytics enable businesses to build processes that encompass a variety of value streams (customers, business partners, internal operations, etc.). The technology offers a much broader set of data management and analysis capabilities and data consumption models. For example, the ability to consolidate data at scale and increase visibility into it has been a desire of the business community for years. Technologies like Hadoop finally make it possible. Businesses no longer need to skip on reporting and insights simply because the technology is not capable or it is too expensive. Case study: Using big data analytics to optimize/automate IT operations Steve: “What was wrong with the server that crashed last week?” Bill: “I don’t know. I rebooted it and it’s just fine. Perhaps the software crashed.” Anyone who has been in IT operations must have had the above dialog, sometimes quite often. Today’s data centers generate immense quantities of data, and the answer to the above question lies in IT’s ability to mine the data and uncover the chain of events. IT operations are a crucial aspect of most organizational operations. Companies rely on their information systems to run their operations. IT must therefore keep high standards for assuring business continuity in spite of hardware or software glitches, network connectivity disruptions, unreliable power systems, etc. Effective IT operations require a balanced investment in both system data gathering and data analysis. Most IT operations nowadays gather up-to-the- minute (or second in some cases) logs from the servers, storage devices, network components, applications running on this infrastructure (i.e. Linux system log), and even the power and cooling components. The data lifecycle (Figure 4) begins with the data being generated and collected. The vast majority of the collected data consists of plain text files that have very little in common in the way the content is structured. Data can be stored in its original format or it can be pre-processed and then stored. Pre-processing increases the value of the data by removing less significant content. The data is then stored and made available for processing. Processing of the data is mainly focused on two attributes:  Extract the value (also called insights) from the data through the use of statistical analysis  Make the results available for presentation in a format that readily communicates the value Figure 4: Data Lifecycle 7
  • 8. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness The last phase of the data lifecycle is the presentation of the insights uncovered along the way. At this phase data reaches its maximum potential and has the biggest impact on decisions derived from analysis. In the broad spectrum of options, presentation may imply graphic presentation of the results (i.e. pie chart) or only bundling the results and shipping them off to an application for further examination. Big data analytics can help optimize/automate IT operations in several ways:  Improve the quality of the control processes by embedding big data analytics in the control path  Keep the system operating within set boundaries by being able to predict the future operational state of the system  Minimize system downtime by avoiding predictable failures Figure 5 illustrates an example of embedding analytics in the control loop of the data center management system. As explained above, system components (hardware or software) generate metering data that is readily available on a system- wide data bus. The analytics engine grabs the metering data from the data bus, processes it, and examines the results against historical data (i.e. data that was gathered in a previous iteration). Next, the analytics engine computes the deviation and compares it with the standard deviation defined in profiles. The analytics engine forwards the comparison results to the intelligent controller, which, after evaluating the particular condition against pre-defined policies, issues control commands back to the system. Figure 5: Embedding Analytics in Automated System Control The control system described above allows IT managers to rethink the operational efficiency of the data center. By harnessing the power of sophisticated analytics, the system’s response can be correlated in a timely manner with the control stimuli and external factors over a broad spectrum of conditions and application workloads. IT managers can optimize the system for the supply side (i.e. utilities), for the demand side (i.e. software applications, business processes, etc.) or for both. The long-term payoffs should outweigh the cost of analytics. 8
  • 9. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Big data analytics challenges in the enterprise The adoption of big data analytics in the enterprise can deliver huge benefits but also presents equally important challenges. Adopting big data analytics is both an opportunity and a challenge. Examples are in order:  An inability to share/correlate knowledge (data and algorithms) across organizational boundaries impacts the bottom line. As mentioned above, analytics are converging. Two or more business units may be working on a similar set of challenges. With no leveraged knowledge among them, each business unit will duplicate efforts only to discover similar solutions. Sharing the value of big data underpins substantial productivity gains and accelerates innovation.  Data is locked in many disparate data marts. This is not necessarily a new challenge. This has been seen since the early days of enterprise databases when two or more departments could not agree on a common set of requirements and decided to go their own ways and build separate data stores. The advent of big data exacerbates the age-old dispute—the sheer volume of data requires even more data marts to be stored. Big data mitigates this challenge by leveraging technologies that are built from the ground up to be scalable and schema-agnostic.  Traditional enterprise IT processes (i.e. user authentication and authorization) don’t scale with big data. Not being able to enforce and audit access controls against huge quantities of data leaves the enterprise open to unauthorized access and theft of the intellectual property. The adoption of Hadoop technology Hadoop has become the most widely known big data technology implementation. The rise of Hadoop proved to be unstoppable. There is a very vibrant community around Hadoop. Venture capitalists are pouring money into startups much like we saw back in 2000 with Internet companies. Most of these startups are enacted as academic research projects. Customer demand eventually brings them into the mainstream marketplace where they start competing with more established providers. On the receiving end of the market, businesses start picking up the pace at which Hadoop is deployed. Businesses begin to realize that data management, processing, and consumption are emerging as key challenges. The wide adoption of Hadoop is hindered by both socio-business and technical factors. Examples of socio-business factors are:  Hiring Just like with any high-end niche technology, the emergence of Hadoop requires bleeding-edge data analytics design, processing, and visualization skills. For example, the Hadoop MapReduce API is more complex than SQL. Managing Hadoop deployments is equally complex. These skill sets are in short supply, thus slowing down the adoption of the technology. Hiring will get better as tools and the underlying technology improves.  Confusion among vendors as well as buyers The rapidly changing market landscape makes it difficult for technology innovators to forecast resource allocation and maximize their returns on investments. Buyers are equally confused because they need more information about the actual business value of the technology and about the costs and the characteristics of successful deployments. Companies like Dell are taking a customer-centric approach. They work directly with customers and vendors to ease the adoption of the technology by providing end-to-end Hadoop solutions and business value metrics, all wrapped in strong services and consulting offerings.  The “checkbox” mentality and the genesis of a new form of vendor lock-in Traditional enterprises demand their IT organizations to require support contracts for all their software applications. The “checkbox” mentality is one in which support is provided so IT can mark off the appropriate checkbox. Yet, businesses realize that true opportunities to improve the bottom line come from a deeper understanding of their internal processes; thus demand for big data is rapidly increasing. That leaves IT with only one option. That is, choose one from many competing vendors. Because of fierce competition among vendors, the one chosen vendor will try to lock in as much functionality as possible. The answer is a leveraged approach: use open source as much as possible and pay only for the support that is deemed absolutely necessary. Look for vendors that offer both open-source and commercial versions of the technology needed. A different, yet long-term, answer is standardization (i.e. of the API, the data models, the algorithms, etc.) 9
  • 10. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Hadoop technical strengths and weaknesses Hadoop has been designed from the ground up for seamless scalability and massively parallel compute and storage. Hadoop has been optimized for high aggregated data throughput (as opposed to query latency). The real power of Hadoop is in the number of compute nodes in the cluster instead of the compute and store capacity of each individual node. Hadoop’s strengths are:  It is highly scalable—Yahoo runs Hadoop on thousands of nodes  It integrates storage and compute—the data is processed right where it is stored  It supports a broad range of data formats (CSV, XML, XSL, GIF, JPEG, SAM, BAM, TXT, JSON, etc.).  Data doesn’t have to be “normalized” before it is stored in Hadoop. Examples of the Hadoop’s weaknesses are:  Security—Hadoop has a fairly incoherent security design. Data access controls are implemented at the lowest level of the stack (the file system on each compute node). Also there is no binding between data access and job access models.  Advanced IT operations and developer skills are required.  Lack of enterprise hardening—the NameNode is a single point of failure. Dell | Hadoop solutions The Dell | Hadoop solutions lower the barrier to adoption for businesses looking to use Hadoop in production. Dell’s customer-centered approach is to create rapidly deployable and highly optimized end-to-end Hadoop solutions running on commodity hardware. Dell provides all the hardware and software components and resources to meet the customer’s requirements and no other supplier need be involved. The hardware platforms for the Dell | Hadoop solutions (Figure 6) are the Dell™ PowerEdge™ C Series and Dell™ PowerEdge™ R Series. Dell PowerEdge C Series servers are focused on hyperscale and cloud capabilities. Rather than emphasizing gigahertz and gigabytes, these servers deliver maximum density, memory, and serviceability while minimizing total cost of ownership. It’s all about getting the processing customers need in the least amount of space and in an energy- efficient package that slashes operational costs. Dell PowerEdge R Series servers are widely popular with a variety of customers for their ease of management, virtually tool less serviceability, power and thermal efficiency, and customer-inspired designs. Dell PowerEdge R Series servers are multi-purpose platforms designed to support multiple usage models/workloads for customers who want to minimize differing hardware product types in their environments. The operating system of choice for the Dell | Hadoop solutions is Linux (i.e. Red Hat Enterprise Linux, CentOS, etc.). The recommended Java Virtual Machine (JVM) is the Oracle Sun JVM. The hardware platforms, the operating system, and the Java Virtual Machine make up the foundation on which the Hadoop software stack runs. Figure 6: Dell | Hadoop Solution Taxonomy 10
  • 11. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness The bottom layer of the Hadoop stack (Figure 6) comprises two frameworks: 1. The Data Storage Framework (HDFS) is the filesystem that Hadoop uses to store data on the cluster nodes. Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem. 2. The Data Processing Framework (MapReduce) is a massively parallel compute framework inspired by Google’s MapReduce papers. The next layer of the stack in the Dell | Hadoop solution design is the network layer. Dell recommends implementing the Hadoop cluster on a dedicated network for two reasons: 1. Dell provides network design blueprints that have been tested and qualified. 2. Network performance predictability—sharing the network with other applications may have a detrimental impact on the performance of the Hadoop jobs. The next two frameworks—the Data Access Framework and the Data Orchestration Framework—comprise utilities that are part of the Hadoop ecosystem. Dell listened to its customers and designed a Hadoop solution that is fairly unique in the marketplace. Dell’s end-to-end solution approach means that the customer can be in production with Hadoop in shortest time possible. The Dell | Hadoop solutions embody all the software functions and services needed to run Hadoop in a production environment. The customer is not left wondering, “What else is missing?” One of Dell’s chief contributions to Hadoop is a method to rapidly deploy and integrate Hadoop in production. Other major contributions include integrated backup, management, and security functions. These complementary functions are designed and implemented side-by-side with the core Hadoop core technology. Installing and configuring Hadoop is non-trivial. There are different roles and configurations that need to deployed on various nodes. Designing, deploying, and optimizing the network layer to match Hadoop’s scalability requires a lot of thinking and also consideration for the type of workloads that will be running on the Hadoop cluster. The deployment mechanism that Dell designed for Hadoop automates the deployment of the cluster from “bare-metal” (no operating system installed) all the way to installing and configuring the Hadoop software components to specific customer requirements. Intermediary steps include system BIOS update and configuration, RAID/SAS configuration, operating system deployment, Hadoop software deployment, Hadoop software configuration, and integration with the customer’s data center applications (i.e. monitoring and alerting). Data backup and recovery is another topic that was brought up during customer roundtables. As Hadoop becomes the de facto platform for business-critical applications, the data that is stored in Hadoop is crucial for ensuring business continuity. Dell’s approach is to offer several enterprise-grade backup solutions and let the customer choose. Customers also commented on the current security model of Hadoop. It is a real concern because as a larger number of business users share access to exponentially increasing volumes of data, the security designs and practices need to evolve to accommodate the scale and the risks involved. Also HIPAA, Sarbanes-Oxley, SAS70, and PCI Security Standards Council may have an interest in data stored in Hadoop. Particularly in industries like healthcare and financial services, access to the data has to be enforced and monitored across the entire stack. Unfortunately, there is no clear answer on how the security architecture of Hadoop is going to evolve. Dell’s approach is to educate the customer and also work directly with leading vendors to deliver a model that suits the enterprise. Lastly, Dell’s open, integrated approach to enterprise-wide systems management enables customers to build comprehensive system management solutions based on open standards and integrated with industry-leading partners. Instead of building a patchwork of solutions leading to systems management sprawl, Dell integrates the management of the Dell hardware running the Hadoop cluster with the “traditional” Hadoop management consoles (Ganglia, Nagios). To summarize, Dell is adding Hadoop to its data analytics solutions portfolio. Dell’s end-to-end solution approach means that Dell will provide readily available software interfaces for integration between the solutions in the portfolio. Dell will provide the ETL connector (Figure 6) that integrates Hadoop with the Dell | Aster Data solution. 11
  • 12. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Dell | Hadoop for the enterprise In this section we introduce several best practices for deploying and running Hadoop in an enterprise environment:  Hardware selection  Integrating Hadoop with Enterprise Data Warehouse (data models, data governance, design optimization)  Data security  Backup and recovery The focus in the paper is only on the introduction and high-level overview of these best practices. Our goal is to raise the awareness among enterprise practitioners and help them create successful Hadoop-based designs. We leave the implementation to be presented in an upcoming white paper titled Hadoop Enterprise How-To published in the same series. The inherent challenge with recommendations for Hadoop in the enterprise is that there is not a lot of published research to draw on. However, Dell has a very strong practice in defining and implementing best practices for its enterprise customers. Thus, we had to take a different approach. Namely, we began with a gap analysis of Hadoop and drew on our enterprise practice to derive recommendations that are likely to have the most profound impact on building Hadoop solutions for the enterprise. As mentioned above, we intentionally left the details for additional white papers because we did not want to run the risk of making this high-level outline overly complex and, thus, fail to meet the original goal, which was to raise awareness. Let’s now look at what it takes to run Hadoop in the enterprise. First off, we’ve been using clustering technologies like HPCC in the enterprise for years. How is Hadoop different from HPCC? The main difference between high-performance computuing (HPC) and Hadoop is in the way the compute nodes in the cluster access the data that they need to process. Traditional HPC architectures employ a shared-disk setup—all compute nodes process data loaded in a shared network storage pool. Network latency and disk bandwidth become the critical factors for HPC job performance. Therefore, low-latency network technologies (like InfiniBand) are commonly deployed in HPC. Hadoop uses a shared-nothing architecture—data is distributed and copied locally on each compute node. Hadoop does not need low-latency network; therefore using cheaper Ethernet networks for Hadoop clusters is the common practice for the vast majority of Hadoop deployments. [11] Got it! Let’s now look at the hardware. Is there anything I should be concerned with? The quick answer is YES. First and foremost, standardization is key. Using the same server platform for all Hadoop nodes can save considerable money and allows for faster deployments. Other best practices for hardware selection include:  Use commodity hardware—Commodity hardware can be re-assigned between applications as needed. Specialized hardware cannot be moved that easily.  Purchase full racks—Hadoop scales very well with number of racks, so why not let Dell do the rack-n-stack and wheel in the full rack?  Abstract the network and naming—Any IP addressing scheme, no matter how complex or laborious, can scale to only a few hundred nodes. Using DNS and CNAMEs scales much better. Okay, I got the racks in production. How do I exchange data between Hadoop and my data marts? The answer varies depending on who is asking the question. To an IT architect, this is a typical system integration challenge. That is, there are two systems (Hadoop and the data mart) that need to be integrated with each other. For example, the IT architect would have to design the network connectivity between the two systems. Figure 7 illustrates a possible network connectivity design. 12
  • 13. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Figure 7: Example of Network Connectivity between a Hadoop Cluster and a Data Mart To a data analyst, this is a data pipeline design challenge (Figure 8). His chief concerns are data formatting, availability of data for processing and analysis, query performance, etc. The data analyst doesn’t need to know the topology of the network connectivity between the Hadoop cluster and the particular data mart. The difference between the two perspectives could hardly be greater. The solution is a mix of IT best practices and database administration best practices. The details are covered in an upcoming white paper, titled Integrating Hadoop and Data Warehouse, published in this same series of papers. Figure 8: Example of Data Pipeline between Hadoop and Data Warehouse 13
  • 14. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Great, I now have data in Hadoop! How should I secure the access to it? Out of all the technical challenges that Hadoop exhibits, the security model is likely to be the biggest obstacle for the adoption of Hadoop in the enterprise. Hadoop relies on Linux user permissions for data access. These user permissions are enforced only at the lowest level of the stack (the HDFS layer on each compute node) instead of being checked and enforced at the metadata layer (the NameNode) or higher. Jobs use the same userID to get access to data stored in Hadoop. A person skilled in the art can deploy a man-in-the-middle or denial-of-service attack. It should be noted that both Yahoo and Cloudera are making intense efforts to bring Hadoop’s security in line with the enterprise. Meanwhile, the security best practices include:  Ensure strong perimeter security—for example, use strong authentication and encryption for all network access to the Hadoop cluster.  Use Kerberos inside Hadoop for user authentication and authorization.  If purchasing support from Cloudera is an option, use Cloudera Enterprise to streamline the management of the security functions across all the machines in the cluster. Great, I’ll pay close attention to security! Last question: how do I back up the data in Hadoop? Again, it depends on who is asking. The IT administrator would be concerned about backup policies, media management, etc. The data analyst wants to make sure that the data has been saved entirely, which means that the backup solution needs to be data-aware. Sometimes a dataset may be composed of more than one file. Any file in Hadoop is broken down in a number of blocks that are handed off to Hadoop nodes for storage. A file-aware (or even worse, block-aware) backup solution will not maintain the dataset metadata (the association rules between files) which will render the dataset completely useless. The intersection between the two views is the vision for Hadoop data backup. The best practices include:  Decide where the data is backed up: NAS, SAN, cloud, or another Hadoop cluster. While using the cloud for backing up the data makes perfect sense, most of the enterprises tend to keep the data private within the corporate firewall. Saving the data to another Hadoop cluster also makes sense; however the destination Hadoop cluster will need a backup solution of its own. Realistically, there are only two options for backup: NAS and SAN. If the backup needs only volume and average performance is acceptable, then the answer is NAS. For best-in-class performance and undisrupted access requirements the answer is SAN.  Dedupe your data.  Prioritize your data—back up only the data that is deemed valuable.  Add dataset metadata awareness to the backup.  Establish backup policies for both metadata and actual data. Great, thanks, that makes sense! What do I do if I have questions? First, please don’t hesitate to contact the author—contact information is provided below. Second, Dell offers a broad variety of consulting, support, and training services for Hadoop. Your Dell sales representative can put you in touch with the Dell Services team. 14
  • 15. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness About the author Aurelian “A.D.” Dumitru is the Dell | Hadoop chief architect. In that role he is responsible for all architecture decisions and long-term strategy for Hadoop. A.D. has over 20 years of experience. He has been with Dell for more than 11 years in various engineering, architecture, and management positions. His background is in hyperscale massively parallel compute systems. His interests are in automated process control, intelligent processes, and machine learning. Over the years he has authored or made significant contributions to more than 20 patent applications, from RFID and automated process controls to software security and mathematical algorithms. For similar topics please check his personal blog www.RationalIntelligence.com. Special thanks The author wishes to thank Nicholas Wakou, Howard Golden, Thomas Masson, Lee Zaretsky, Joey Jablonski, Scott Jensen, John Igoe, and Matthew McCarthy for their helpful comments. About Dell Next Generation Computing Solutions When cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell Next Generation Computing Solutions are Dell’s response to your unique needs. We understand your challenges—from compute and power density to global scaling and environmental impact. Dell has the knowledge and expertise to tune your company’s “factory” for maximum performance and efficiency. Dell’s Next Generation Computing Solutions provide operational models backed by unique product solutions to meet the needs of companies at all stages of their lifecycles. Solutions are designed to meet the needs of small startups while allowing scalability as your company grows. Deployment and support are tailored to your unique operational requirements. Dell’s Cloud Computing Solutions can help you minimize the tangible operating costs that have hyperscale impact on your business results. Hadoop ecosystem component “decoder ring” 1. Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data 2. MapReduce: a software framework for distributed processing of large data sets on compute clusters 3. Avro: a data serialization system 4. Chukwa: a data collection system for managing large distributed systems 5. HBase: a scalable, distributed database that supports structured data storage for large tables 6. Hive: a data warehouse infrastructure that provides data summarization and ad-hoc querying 7. ZooKeeper: a high-performance coordination service for distributed applications 8. Pig: a platform for analyzing large data sets that consists of high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. 9. Sqoop (from Cloudera): a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database. 10. Flume (from Cloudera): a distributed service for collecting, aggregating and moving large amounts of log data. Its architecture is based on streaming data flows. (Source: http://hadoop.apache.org/) 15
  • 16. Dell | Hadoop White Paper Series: Hadoop Enterprise Readiness Bibliography [1] Donald F. Ferguson et al. Enterprise Business Process Management—Architecture, Technology and Standards. Lecture Notes on Computer Science 4102, 1-15, 2006 [2] Andrew Spanyi, Business Process Management (BPM) is a Team Sport: Play it to Win! Meghan-Kiffer Press, June 2003, ISBN 978-0929652023 [3] http://en.wikipedia.org/wiki/Business_process_management [4] David W. McCoy, Business Activity Monitoring: Calm Before the Storm, Gartner 2002, http://www.gartner.com/resources/105500/105562/105562.pdf [5] http://en.wikipedia.org/wiki/Process_mining [6] http://www.bpminstitute.org/articles/article/article/bringing-analytics-into-processes-using-business-rules.html [7] http://en.wikipedia.org/wiki/Convergent_evolution [8] http://en.wikipedia.org/wiki/Big_data [9] http://www.asterdata.com/blog/2008/05/19/discovering-the-dimensions-of-scalability/ [10] McKinsey Global Institute, Big data: The next frontier for innovation, competition, and productivity, May 2011 [11] S. Krishnan et al., myHadoop—Hadoop-on-demand on Traditional HPC Resources, University of California at San Diego, 2010 To learn more To learn more about Dell cloud solutions, contact your Dell representative or visit: www.dell.com/hadoop ©2011 Dell Inc. All rights reserved. Trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Specifications are correct at date of publication but are subject to availability or change without notice at any time. Dell and its affiliates cannot be responsible for errors or omissions in typography or photography. Dell’s Terms and Conditions of Sales and Service apply and are available on request. Dell service offerings do not affect consumer’s statutory rights. Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc. 16