Join us to learn how Hortonworks Data Platform and Nimble Storage provide an enterprise-ready data platform for multi-workload data processing. HDP supports an array of processing methods — from batch through interactive to real-time, with key capabilities required of an enterprise data platform — spanning Governance, Security and Operations. Nimble Storage provides the performance, capacity, and availability for HDP and allows you to take advantage of Hadoop with minimal changes to existing data architectures and skillsets.
Enterprise Hadoop with Hortonworks and Nimble Storage
1. Page 1 Hortonworks Confidential 2014
Enterprise Hadoop with Hortonworks and Nimble
Storage
Ajay Singh
Director of Technical Alliance - Hortonworks
Ibrahim “Ibby” Rahmani,
Product and Solutions Marketing- Nimble Storage
2. Page 2 Hortonworks Confidential 2014
Agenda
• Hortonworks Overview
• Big Data Use Cases
• Hadoop Journey and Phases of Adoption
• Requirements of Enterprise Hadoop
• Key Trends
4. Page 4 Hortonworks Confidential 2014
Our Mission:
Power your Modern Data Architecture
with HDP and Enterprise Apache Hadoop
Who we are
June 2011: Original 24 architects, developers, operators of Hadoop from Yahoo!
June 2014: An enterprise software company with 420+ Employees
Key Partners
Our model
Innovate and deliver Apache Hadoop as a complete enterprise data platform
completely in the open, backed by a world class support organization
6. Page 6 Hortonworks Confidential 2014
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform
Hadoop
&YARN
Pig
Hive&HCatalog
HBase
Sqoop
Oozie
Zookeeper
Ambari
Storm
Flume
Knox
Phoenix
Accumulo
2.2.0
0.12.0
0.12.0
2.4.0
0.12.1
Data
Management
0.13.0
0.96.1
0.98.0
0.9.1
1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5
0.4.0
4.0.0
1.5.1
Falcon
0.5.0
Ranger
Spark
Kafka
0.14.0
0.14.0
0.98.4
1.6.1
4.2
0.9.3
1.2.0
0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0
0.5.0
0.4.0
2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slider
0.60
HDP 2.0
October
2013
HDP 2.2
December
2014
HDP 2.1
April
2014
Solr
4.7.2
4.10.0
0.5.1
Data Access
Governance
& Integration
SecurityOperations
7. Page 7 Hortonworks Confidential 2014
YARN
:
Data
Opera.ng
System
Script
Pig
Search
Solr
SQL
Hive/Tez,
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Batch
Map
Reduce
HDFS
(Hadoop
Distributed
File
System)
Contributes more to the Apache Hadoop
ecosystem in the ASF than any other vendor
Hadoop is a platform decision
• Open Source: fastest path to innovation for a platform technology
• Eliminate vendor lock in, no proprietary software
• Data center leaders have committed to the open source approach
Apache
Project
Committers
PMC
Members
Hadoop 27 20
Accumulo 2 2
Ambari 33 27
Falcon 5 3
Flume 1 0
HBase 6 4
Hive 17 4
Knox 12 3
Oozie 3 2
Pig 5 5
Sqoop 1 1
Storm 3 2
Tez 15 15
Zookeeper 2 1
TOTAL 132 89
HDP 2.1Governance
&Integration
Security
Operations
Data Access
Data Management
YARN
Community Leadership
Leading Hadoop Innovations; 100% Open Source
8. Page 8 Hortonworks Confidential 2014
Proven By Customer Success
Customer Momentum
• 300+ customers in seven quarters, growing at 75+/quarter
• 30+ customers migrated from other distributions
• Two thirds of customers come from F1000
• 100% Renewal Rate
Largest Cluster in North America
32,000 Nodes
Largest Cluster in Europe
1,000 Nodes
Experience at Scale
80,000 nodes under contract
Largest Known Cluster in APAC
400 Nodes
Fastest growing Fortune 1000 customer base
Market Leadership
9. Page 9 Hortonworks Confidential 2014
Big Data Trends & The Modern Data Architecture
10. Page 10 Hortonworks Confidential 2014
APPLICATIONSDATASYSTEM
Business
Analytics
Custom
Applications
Packaged
Applications
Traditional systems under pressure
• Silos of Data
• Costly to Scale
• Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor. Machine Data
Unstructured docs, emails
Server logs
SOURCES
Existing Sources
(CRM, ERP,…)
RDBMS EDW MPP
New Data Types
…and difficult to
manage new data
LIMITATIONS
Silos & Expensive
Single Purpose
11. Page 11 Hortonworks Confidential 2014
1. Unlock New Applications from New Types of Data
INDUSTRY USE CASE
Sentiment
& Web
Clickstream
& Behavior
Machine
& Sensor
Geographic Server Logs
Structured &
Unstructured
Financial Services
New Account Risk Screens ✔ ✔
Trading Risk ✔
Insurance Underwriting ✔ ✔ ✔
Telecom
Call Detail Records (CDR) ✔ ✔
Infrastructure Investment ✔ ✔
Real-time Bandwidth Allocation ✔ ✔ ✔
Retail
360° View of the Customer ✔ ✔ ✔
Localized, Personalized Promotions ✔
Website Optimization ✔
Manufacturing
Supply Chain and Logistics ✔
Assembly Line Quality Assurance ✔
Crowd-sourced Quality Assurance ✔
Healthcare
Use Genomic Data in Medial Trials ✔ ✔ ✔
Monitor Patient Vitals in Real-Time ✔ ✔
Pharmaceuticals
Recruit and Retain Patients for Drug Trials ✔ ✔
Improve Prescription Adherence ✔ ✔ ✔ ✔
Oil & Gas
Unify Exploration & Production Data ✔ ✔ ✔ ✔
Monitor Rig Safety in Real-Time ✔ ✔ ✔
Government
ETL Offload/Federal Budgetary Pressures ✔ ✔
Sentiment Analysis for Government Programs ✔
12. Page 12 Hortonworks Confidential 2014
2. Or to realize a dramatic cost savings…
✚
EDW Optimization
OPERATIONS
50%
ANALYTICS
20%
ETL PROCESS
30%
OPERATIONS
50% ANALYTICS
50%
Current Reality
EDW at capacity: some usage
from low value workloads
Older data archived, unavailable
for ongoing exploration
Source data often discarded
Augment w/ Hadoop
Free up EDW resources from
low value tasks
Keep 100% of source data and historical
data for ongoing exploration
Mine data for value after loading it
because of schema-on-read
MPP
SAN
Engineered System
NAS
HADOOP
Cloud Storage
$0 $20,000 $40,000 $60,000 $80,000 $180,000
Fully-loaded Cost Per Raw TB of Data (Min–Max Cost)
Commodity Compute & Storage
Hadoop Enables Scalable Compute &
Storage at a Compelling Cost Structure
Hadoop
Parse, Cleanse
Apply Structure, Transform
Storage Costs/Compute Costs
from $19/GB to $0.23/GB
13. Page 13 Hortonworks Confidential 2014
3. Data Lake: An architectural shiftSCALE
SCOPE
Unlocking the Data Lake
RDBMS
MPP
EDW
Data Lake
Enabled by YARN
• Single data repository,
shared infrastructure
• Multiple biz apps
accessing all the data
• Enable a shift from
reactive to proactive
interactions
• Gain new insight across
the entire enterprise
New Analytic Apps
or IT Optimization
HDP 2.1
Governance
&Integration
Security
Operations
Data Access
Data Management
YARN
15. Page 15 Hortonworks Confidential 2014
Business Value from Hadoop
Flight Plan for a Journey in Four Phases
1
2
Evaluation –
Business
Value
Awareness
& Interest
Evaluation –
Technical
Enterprise
Deployment
Enterprise
Production
Industry
Leadership
Point
Deployment
Point
Production
3 4Operational Value Strategic Value Data-Driven
Organization
* Timeline varies by company size. Often smaller or focused online businesses achieve milestones at the shorter end of the range.
Flight plan – typical elapsed time*
from start of phase 1 in months:
2-6 9-15 18-36
Potential Value
16. Page 16 Hortonworks Confidential 2014
1 2 3 4
What Would You Like to Accomplish?
Levels of Success with Hadoop
Potential Value Operational Value Strategic Value Data-Driven Organization
CXO • Recognition of potential
• Mandate to explore
• Recognition of value realized
• Sponsorship to expand use
• Recognition of material value realized
• Sponsorship to transform organization
• Competitive advantage
• CDO part of Exec Team
Line of
Business
• Basic understanding of
the value of Hadoop to
the business
• Value realized in 1 area
‒ Customer intimacy
‒ Operational excellence
‒ Risk, security, compliance
‒ New business
• Value realized and tracked in many areas
‒ Customer intimacy
‒ Operational excellence
‒ Risk, security, compliance
‒ New business
• Data managed like capital
• Intelligence at the front line
• JIT decision making
• Widespread value creation
Analytics &
Applications
• Basic understanding
how Hadoop fits into
existing landscape
• BI and EDW access to Hadoop
• Some new analytic apps, often batch
• Few use cases and processing engines
• Many sources and time periods
• Mostly departmental silos
• 10-50 enterprise users
• Hadoop consumable by any department,
both technically and process-wise
• New apps natively on Hadoop, often
transactional or real-time
• Many use cases and processing engines
• Multiple lenses into common data pool
• Emerging data science team
• 50-500 enterprise users
• Data-driven culture
• High-performing data
science team
• Use cases build on each
other
• 500-5000 enterprise users
Data Mgt. &
Security
• Basic understanding
how Hadoop fits
• Benefitting from schema on read
• Professionalizing data definitions and
models
• Collaboration and granular security controls
governing use of shared data
• Incentives and process to
encourage consumption of
shared data
Infra-
structure
• Basic fluency with core
technical concepts of
Hadoop
• 1 or more production environments • Multi-tenant shared service worldwide
• Data Lake
• Service Desk / CoE
• Hadoop community
participation and contribution
18. Page 18 Hortonworks Confidential 2014
The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single
App
BATCH
HDFS
Single
App
INTERACTIVE
Single
App
BATCH
HDFS
• All other usage patterns must
leverage that same
infrastructure
• Forces the creation of silos for
managing mixed workloads
Single
App
BATCH
HDFS
Single
App
ONLINE
19. Page 19 Hortonworks Confidential 2014
20092006
1
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop
Distributed
File
System)
MapReduce
Largely
Batch
Processing
Hadoop
w/
MapReduce
YARN: Data Operating System
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-‐279:
YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013
20. Page 20 Hortonworks Confidential 2014
A Blueprint for Enterprise Hadoop
Load data
and manage
according
to policy
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
DATA MANAGEMENT
SECURITYDATA ACCESS
GOVERNANCE
& INTEGRATION
OPERATIONS
Enable both existing and new application to
provide value to the organization
PRESENTATION & APPLICATION
Empower existing operations and
security tools to manage Hadoop
ENTERPRISE MGMT & SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS
YARN Data Operating System
21. Page 21 Hortonworks Confidential 2014
HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN
23. Page 23 Hortonworks Confidential 2014
Modern Data Architecture
• Enterprise Hadoop as single consolidated
Data Lake
• Deep Integration with existing systems
• Accelerated Interactive & Real-Time
Capabilities
• Central services for security, governance
and operation
APPLICATIONSDATASYSTEM
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-TimeBatch
CRM
ERP
Other
1 ° ° °
° ° ° °
HDFS
(Hadoop Distributed File System)
SOURCES
EXISTING
Systems
Clickstream
Web
&Social
Geoloca.on
Sensor
&
Machine
Server
Logs
Unstructured
Hadoop As Enterprise Data Lake
24. Page 24 Hortonworks Confidential 2014
Development
& POC Cluster
Production
Cluster
Multiple Deployment Choices
Deployment Choice
• Linux, Windows
• On-Premises, Public/Private Cloud,
Hybrid
“Tethered” Clusters
• Compatible services
• An explicit “connection”
Synchronized Datasets
• Efficient sharing & access
• Governance & lineage
BI or ML
Cluster
Backup
& Archive Cluster
Learn
On-Premise & Cloud Deployments
Physical & Virtual Clusters
25. Page 25 Hortonworks Confidential 2014
Cloud Backup & Storage Tiering
Dataset Backup / Archival
• Deliver business continuity through replication across on-
premises and cloud-based storages targets; Microsoft
Azure and Amazon S3
• Lineage as a GA feature with supporting documentation
and examples
Storage Tiers in HDFS
• HDFS Heterogeneous storage tiering feature
• Allow for the definition of hot/cold storage tiers within a
cluster with all data remaining in cluster for data lake
• Higher density storage, lower CPU and memory footprint
machines further drive costs down for the hardware used
in the cold storage tier
Backup
& Archive Cluster
Production
Cluster
26. Page 26 Hortonworks Confidential 2014
Expanded Infrastructure Choices
Servers with Internal Storage
§ High performance
§ Low upfront cost
§ Limited data movement
Key Technology Trends
§ Fast & cost effective networks
§ SSD storage
Servers with Shared Storage
§ Ease of administration
§ Independent scale out of compute &
storage
§ Shared storage infrastructure for Big Data
and Legacy applications
§ High memory servers
§ Scale out shared storage sub systems