OSC2012: Big Data Using Open Source: Netapp Project - Technical

Open source Big Data case study: Building a
platform for remote device support at NetApp
(Part II – Technical)

Topics

 Big Data Perspective

 Case Study: NetApp AutoSupport

 Technology Primer

 Design Overview

Copyright © 2012 Accenture All rights reserved. 2

Big Data

The concept is disruptive. The technology is disruptive. And, markets and
clients are being impacted.

1 Wordle for Credit Suisse, Does Size Matter Only?, September 2011


Shifts in Data and Analytics
The changing landscape and required winning strategies are creating shifts
within Big Data collection and analytics
Data Explosion Monetization
• Unstructured data is doubling • Growth of enterprise data
every 3 months monetization services
• 2011 saw 47% growth overall • Large retailers monetizing own
• By 2015, number of networked data to provide insights to
devices will be 2x global suppliers
population
Data-led Innovation Social Media
• De-coupling data from • Growing market for scrubbed,
applications aggregate data from social
• Disparate external data shaping media and blogs
context • Greater focus on data that
• Cost effective mobilization of provides insight in a customer’s
massive scale data digital persona

Technology Data Mobilization
• Commodity priced storage and • Novel approaches to analyze
compute unstructured data creating
shorter time from data to insight
• Emergence of open source and
big data technologies solving • Shift towards data consumption
production problems at scale in multiple environments
(business apps, mobile, social)


The Big Data Approach

Treat data as a strategic asset, seek to
maximize it’s value to the organization

Invest in common services, data platforms
and tools

Rapidly prototype, deliver, and measure
value-added data services, evolve over time

• Data-driven decision making • End-to-end ownership of
• Experimentation and services
continuous improvement with • Sharing of platform, tools and
academic rigor code
Culture

Topics




 Design Overview


Client Context

NetApp, Inc.
• Industry: Data storage, data management
• 77% Fortune 500 companies are customers
• Creator of Data ONTAP: industry leading storage OS


AutoSupport

• Secure automated “call-home” service
• Catch issues before they become critical
• System monitoring and alerting
• RMA requests without customer action
• Faster incident management

AutoSupport
Storage Devices Messages AutoSupport
Data Warehouse


Business Challenges
SAP CRM MyASUP eBI STOR ASUP Tools Analytics & Mining

• Increase in response times / lower Presentation

availability for services CRM Module

Rules Module
Java Interface

Rules
Rules
Jasper

Stored Proc
Rest Interface

Rules
Rules
Rules
Rules
Various Interface

Rules

• Incoming data volume doubling every 16
Rules Rules Rules
eB
PMBTA BI
I
Integrate

months Custom ETL Custom ETL
DSS

Custom ETL Custom ETL Transform

• Proliferation of ad hoc datamarts and Xterra DB PWillows
DW 3
ODS

DW 2 Adhoc DB’s
Stage

point solutions Xterra
Parser
Light
Parser
Parser
Loader

Parser
Core
Parser Adhoc Extract

• Unable to analyze full AutoSupport
Parsers

Xterra
File
Source

contents efficiently
SAP CRM GEO DRM HDD
ASUP STAGE PNOW DM
File Storage
Messages

AutoSupport Flat-File Storage Requirement
3500
3000
Total Usage (tb)
2500
Projected Total Usage (tb)
2000
1500 Doubles
1000
500
0
Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16


Solution Design Goals
Improve data access and technology cost effectiveness and performance.

• Improve system response times
and data availability
• Expose common data services for
consumption across business units
• Standardize key business metrics
into common rules repository
• Lower operational costs as
ecosystem continues to scale
• Provide more granular analytical
capabilities


Role of Open Source
Platform is composed of open source technologies purpose-built for large-scale
storage, processing and analysis

1 Actual Big Data Solution Blueprint for a hybrid deployment


Topics




 Design Overview


Technology Primer – Hadoop
Hadoop Distributed Filesystem Hadoop MapReduce
(HDFS) • Parallel processing for large datasets
• Divides files into smaller “blocks”, across machines
stored across machines • Breaks job into tasks, using a simple map()
• Automated replication, fault tolerance and reduce() paradigm for data flows


Technology Primer – MapReduce

MapReduce
Map(key,value)
(Simple Example – Word Count)
Reduce(key, List<value> values)
Map Phase Shuffle Phase

<one,1>
<one,1>
m <fish,1>
Input <two,1>
r
One fish, <two,1>
m <fish,1> <red,1>
two fish,
r <blue,1>
red fish,
blue fish. <red,1>
m
<fish,1>
r <fish,4>

m <blue,1>
<fish,1>

Technology Primer – NoSQL

• “Not only” SQL
• Catch-all term for various non-relational database systems

• Typical areas of differentation
• Data model semantics
• eg. Database, Document, Key-Value
• CAP trade-offs
• Consistency, Availability, Partition-Tolerance
• Scale-out architecture
• eg. Sharding, Distributed hash
• Query language

Examples: HBase, Cassandra, mongoDB, Neo4j, etc.

Topics




 Design Overview


Data Pipeline Overview

Data Service
Interface

Incoming Messages

Core Data Ad hoc
Ingestion
Processing analytics

ETL


Data Ingestion
Technologies
• Apache Flume, Apache Hadoop, Drools BRMS, JMS
Capabilities
• Handle dynamic data volumes
Notifications
• Normalization of disparate file formats
• Real-time aggregation of documents JMS

• JMS alerts for critical messages
Parsing tier Aggregation & sink tier

Documents from
Front End HTTP/SMTP Flume Flume Flume
Gateway Routing tier agent agent agent
Aggregated files

Flume Flume Flume Flume
client agent agent agent
Rules HDFS
Engine
Flume Flume Flume
agent agent agent


Core Data Processing
Technologies
• MapReduce, HBase, Solr, Avro
Capabilities
• Parallel processing for increased throughput
• Efficient storage of complex data objects in Avro
Search indexes

Parse text Solr
contents Transform and derive data objects
Primary storage
Documents gathered
from Flume Map
HBase
Reduce
Map HDFS
Write derived objects to Data warehouse
data stores

Map
Reduce Hive

Data Services
Technologies
• Apache HBase, Solr, Tomcat
Capabilities
• Unified web services API for end
users
• Support for complex queries and
searches across multiple dimensions
with Solr
• Access both raw and derived content
for a given system


Analytics / ETL
Technologies
• Apache Hive, Pig, Datameer (Ad hoc analytics)
• Pentaho (ETL / Data Integration)
Capabilities
• Analytical environment for both business analysts and “power
users”
• Hive or Pig as higher level query languages
• Datameer for analytics with a spreadsheet UI
• ETL through Pentaho MapReduce
• (runs Pentaho ETL server inside of a MapReduce Job)


Successes and Challenges
Successes
• Web service interface contracts simplified integration with
user tools, allowed for flexibility in internal implementation
• Open source core allowed rapid for rapid iteration
• Met or exceeded all SLAs using commodity hardware,
significantly driving down costs
Challenges
• Monitoring a large distributed system requires discipline and
a strong operations team
• Shared storage systems and Big Data technologies don’t
always play well together
• “Schemaless” systems can become a headache to
maintain, especially with complex data models


Thank you

Jonathan Bender
Consultant, Accenture Technology Labs
jonathan.bender@accenture.com


OSC2012: Big Data Using Open Source: Netapp Project - Technical

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (17)

Semelhante a OSC2012: Big Data Using Open Source: Netapp Project - Technical

Semelhante a OSC2012: Big Data Using Open Source: Netapp Project - Technical (20)

Mais de Accenture the Netherlands

Mais de Accenture the Netherlands (20)

OSC2012: Big Data Using Open Source: Netapp Project - Technical