Mais conteúdo relacionado Semelhante a OSC2012: Big Data Using Open Source: Netapp Project - Technical (20) Mais de Accenture the Netherlands (20) OSC2012: Big Data Using Open Source: Netapp Project - Technical1. Open source Big Data case study: Building a
platform for remote device support at NetApp
(Part II – Technical)
2. Topics
Big Data Perspective
Case Study: NetApp AutoSupport
Technology Primer
Design Overview
Copyright © 2012 Accenture All rights reserved. 2
3. Big Data
The concept is disruptive. The technology is disruptive. And, markets and
clients are being impacted.
1 Wordle for Credit Suisse, Does Size Matter Only?, September 2011
Copyright © 2012 Accenture All rights reserved. 3
4. Shifts in Data and Analytics
The changing landscape and required winning strategies are creating shifts
within Big Data collection and analytics
Data Explosion Monetization
• Unstructured data is doubling • Growth of enterprise data
every 3 months monetization services
• 2011 saw 47% growth overall • Large retailers monetizing own
• By 2015, number of networked data to provide insights to
devices will be 2x global suppliers
population
Data-led Innovation Social Media
• De-coupling data from • Growing market for scrubbed,
applications aggregate data from social
• Disparate external data shaping media and blogs
context • Greater focus on data that
• Cost effective mobilization of provides insight in a customer’s
massive scale data digital persona
Technology Data Mobilization
• Commodity priced storage and • Novel approaches to analyze
compute unstructured data creating
shorter time from data to insight
• Emergence of open source and
big data technologies solving • Shift towards data consumption
production problems at scale in multiple environments
(business apps, mobile, social)
Copyright © 2012 Accenture All rights reserved. 4
5. The Big Data Approach
Treat data as a strategic asset, seek to
maximize it’s value to the organization
Invest in common services, data platforms
and tools
Rapidly prototype, deliver, and measure
value-added data services, evolve over time
• Data-driven decision making • End-to-end ownership of
• Experimentation and services
continuous improvement with • Sharing of platform, tools and
academic rigor code
Culture
Copyright © 2012 Accenture All rights reserved. 5
6. Topics
Big Data Perspective
Case Study: NetApp AutoSupport
Technology Primer
Design Overview
Copyright © 2012 Accenture All rights reserved. 6
7. Client Context
NetApp, Inc.
• Industry: Data storage, data management
• 77% Fortune 500 companies are customers
• Creator of Data ONTAP: industry leading storage OS
Copyright © 2012 Accenture All rights reserved. 7
8. AutoSupport
• Secure automated “call-home” service
• Catch issues before they become critical
• System monitoring and alerting
• RMA requests without customer action
• Faster incident management
AutoSupport
Storage Devices Messages AutoSupport
Data Warehouse
Copyright © 2012 Accenture All rights reserved. 8
9. Business Challenges
SAP CRM MyASUP eBI STOR ASUP Tools Analytics & Mining
• Increase in response times / lower Presentation
availability for services CRM Module
Rules Module
Java Interface
Rules
Rules
Jasper
Stored Proc
Rest Interface
Rules
Rules
Rules
Rules
Various Interface
Rules
• Incoming data volume doubling every 16
Rules Rules Rules
eB
PMBTA BI
I
Integrate
months Custom ETL Custom ETL
DSS
Custom ETL Custom ETL Transform
• Proliferation of ad hoc datamarts and Xterra DB PWillows
DW 3
ODS
DW 2 Adhoc DB’s
Stage
point solutions Xterra
Parser
Light
Parser
Parser
Loader
Parser
Core
Parser Adhoc Extract
• Unable to analyze full AutoSupport
Parsers
Xterra
File
Source
contents efficiently
SAP CRM GEO DRM HDD
ASUP STAGE PNOW DM
File Storage
Messages
AutoSupport Flat-File Storage Requirement
3500
3000
Total Usage (tb)
2500
Projected Total Usage (tb)
2000
1500 Doubles
1000
500
0
Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16
Copyright © 2012 Accenture All rights reserved. 9
10. Solution Design Goals
Improve data access and technology cost effectiveness and performance.
• Improve system response times
and data availability
• Expose common data services for
consumption across business units
• Standardize key business metrics
into common rules repository
• Lower operational costs as
ecosystem continues to scale
• Provide more granular analytical
capabilities
Copyright © 2012 Accenture All rights reserved. 10
11. Role of Open Source
Platform is composed of open source technologies purpose-built for large-scale
storage, processing and analysis
1 Actual Big Data Solution Blueprint for a hybrid deployment
Copyright © 2012 Accenture All rights reserved. 11
12. Topics
Big Data Perspective
Case Study: NetApp AutoSupport
Technology Primer
Design Overview
Copyright © 2012 Accenture All rights reserved. 12
13. Technology Primer – Hadoop
Hadoop Distributed Filesystem Hadoop MapReduce
(HDFS) • Parallel processing for large datasets
• Divides files into smaller “blocks”, across machines
stored across machines • Breaks job into tasks, using a simple map()
• Automated replication, fault tolerance and reduce() paradigm for data flows
Copyright © 2012 Accenture All rights reserved. 13
14. Technology Primer – MapReduce
MapReduce
Map(key,value)
(Simple Example – Word Count)
Reduce(key, List<value> values)
Map Phase Shuffle Phase
<one,1>
<one,1>
m <fish,1>
Input <two,1>
r
One fish, <two,1>
m <fish,1> <red,1>
two fish,
r <blue,1>
red fish,
blue fish. <red,1>
m
<fish,1>
r <fish,4>
m <blue,1>
<fish,1>
Copyright © 2012 Accenture All rights reserved. 14
15. Technology Primer – NoSQL
• “Not only” SQL
• Catch-all term for various non-relational database systems
• Typical areas of differentation
• Data model semantics
• eg. Database, Document, Key-Value
• CAP trade-offs
• Consistency, Availability, Partition-Tolerance
• Scale-out architecture
• eg. Sharding, Distributed hash
• Query language
Examples: HBase, Cassandra, mongoDB, Neo4j, etc.
Copyright © 2012 Accenture All rights reserved. 15
16. Topics
Big Data Perspective
Case Study: NetApp AutoSupport
Technology Primer
Design Overview
Copyright © 2012 Accenture All rights reserved. 16
17. Data Pipeline Overview
Data Service
Interface
Incoming Messages
Core Data Ad hoc
Ingestion
Processing analytics
ETL
Copyright © 2012 Accenture All rights reserved. 17
18. Data Ingestion
Technologies
• Apache Flume, Apache Hadoop, Drools BRMS, JMS
Capabilities
• Handle dynamic data volumes
Notifications
• Normalization of disparate file formats
• Real-time aggregation of documents JMS
• JMS alerts for critical messages
Parsing tier Aggregation & sink tier
Documents from
Front End HTTP/SMTP Flume Flume Flume
Gateway Routing tier agent agent agent
Aggregated files
Flume Flume Flume Flume
client agent agent agent
Rules HDFS
Engine
Flume Flume Flume
agent agent agent
Copyright © 2012 Accenture All rights reserved. 18
19. Core Data Processing
Technologies
• MapReduce, HBase, Solr, Avro
Capabilities
• Parallel processing for increased throughput
• Efficient storage of complex data objects in Avro
Search indexes
Parse text Solr
contents Transform and derive data objects
Primary storage
Documents gathered
from Flume Map
HBase
Reduce
Map HDFS
Write derived objects to Data warehouse
data stores
Map
Reduce Hive
Copyright © 2012 Accenture All rights reserved. 19
20. Data Services
Technologies
• Apache HBase, Solr, Tomcat
Capabilities
• Unified web services API for end
users
• Support for complex queries and
searches across multiple dimensions
with Solr
• Access both raw and derived content
for a given system
Copyright © 2012 Accenture All rights reserved. 20
21. Analytics / ETL
Technologies
• Apache Hive, Pig, Datameer (Ad hoc analytics)
• Pentaho (ETL / Data Integration)
Capabilities
• Analytical environment for both business analysts and “power
users”
• Hive or Pig as higher level query languages
• Datameer for analytics with a spreadsheet UI
• ETL through Pentaho MapReduce
• (runs Pentaho ETL server inside of a MapReduce Job)
Copyright © 2012 Accenture All rights reserved. 21
22. Successes and Challenges
Successes
• Web service interface contracts simplified integration with
user tools, allowed for flexibility in internal implementation
• Open source core allowed rapid for rapid iteration
• Met or exceeded all SLAs using commodity hardware,
significantly driving down costs
Challenges
• Monitoring a large distributed system requires discipline and
a strong operations team
• Shared storage systems and Big Data technologies don’t
always play well together
• “Schemaless” systems can become a headache to
maintain, especially with complex data models
Copyright © 2012 Accenture All rights reserved. 22
23. Thank you
Jonathan Bender
Consultant, Accenture Technology Labs
jonathan.bender@accenture.com
Copyright © 2012 Accenture All rights reserved. 23