Solving Big Data Problems using Hortonworks

Solving Big Data Problems
using Hortonworks
© Hortonworks Inc. 2011 – 2015. All Rights Reserved

ONLY
100open source
Apache Hadoop data platform
% Founded in 2011
HADOOP
1ST
provider to go public
IPO 4Q14 (NASDAQ: HDP)
employees across
800+
countries
technology partners
1,350
17
TM
Hortonworks Company Profile
Fastest company to reach $100 M in revenue

Let’s talk about Big Data
, September 2014 survey of 100 CIOs from the US and Europe

What problems and opportunities does Big Data
create?
Data that
traditional
platforms
cannot handleNEW
TRADITIONAL
The Opportunity
Unlock transformational business value
from a full fidelity of data and analytics
for all data.
Geolocation
Server logs
Files & emails
ERP, CRM, SCM
Traditional Data Sources
New Data Sources
Sensors
and machines
Clickstream
Social media

The Future of Data: Actionable Intelligence
D A T A I N M O T I O N
STORAGE
STORAGE
GR OU P 2GR OU P 1
GR OU P 4GR OU P 3
D A T A
A T R E S T
INTERNET
OF
ANYTHING

Hortonworks Data Platform
H O R T O N W O R K S D ATA P L AT F O R M
Batch Interactive Search Streaming Machine Learning
YARN Resource Management System
CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATIONS SERVER LOG EXISTING

HDP is a collection of Apache Projects
HORTONWORKS DATA PLATFORM
Hadoop&
YARN
Flume
Oozie
Pig
Hive
Tez
Sqoop
Cloudbreak
Ambari
Slider
Kafka
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.01.7.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
1.3.1
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.3
July 2015
4.2.0
Ongoing Innovation in Apache
0.96.1
0.98.0 0.9.1
0.8.1

Hortonworks Data Flow
Visual User Interface
Drag and drop for efficient, agile operations
Immediate Feedback
Start, stop, tune, replay dataflows in real-time
Adaptive to Volume and Bandwidth
Any data, big or small
Event Level Data Provenance
Governance, compliance & data evaluation
Secure Data Acquisition & Transport
Fine grained encryption for controlled data
sharing and selective data democratization
Powered by
Apache NiFi

HDF and HDP Deliver a Complete Big Data Solution
• HDF dynamically connects HDP to
data at the edge
• HDF secures and encrypts the
movement of data into HDP
• HDF includes mature IoAT data
protocols that improve device
extensibility
• HDF supports easily adjustable bi-
direction IoAT dataflows
• HDF offers traceability of IoAT data
with lineage and audit trails
• HDF brings a real-time, visual user
interface to manipulate live dataflows

STORAGE
STORAGE
Hortonworks Revenue Model
HDP and HDF are 100% free and
Open Source – no license.
Our customers subscribe to
support, consulting experts and
training programs
Annual Subscriptions
align your success with ours
Expert Consulting & Training
help your team get to actionable intelligence
as efficiently as possible
ARCHITECT
&
DEVELOP
DEPLOY
OPERATE
Project 1
Project 5
Project 4
Project 3
Project 2
Project 6
EXPAND

Hadoop Driver: Cost optimization
Archive Data off EDW
Move rarely used data to Hadoop as active
archive, store more data longer
Offload costly ETL process
Free your EDW to perform high-value functions like
analytics & operations, not ETL
Enrich the value of your EDW
Use Hadoop to refine new data sources, such as
web and machine data for new analytical context
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP helps you reduce costs and optimize the value associated with your EDW
ANALYTICSDATASYSTEMS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP 2.3
ELT
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data,
Deeper Archive
& New Sources
Enterprise Data
Warehouse
Hot
MPP
In-Memory
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
Existing Systems
ERP CRM SCM
SOURCES

Single View
Improve acquisition and retention
Predictive Analytics
Identify your next best action
Data Discovery
Uncover new findings
Financial Services
New Account Risk Screens Trading Risk Insurance Underwriting
Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service
Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement
Telecom
Unified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse
Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis
Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers
Retail
360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase
Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs
Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior
Manufacturing
Supply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data
Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance
Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields
Healthcare
Electronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials
Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste
Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service
Oil & Gas
Unify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration
DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells
Government
Single View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness
Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting
Hadoop Driver: Advanced analytic applications

NiFi and HDF Drivers
Optimize Splunk:
Reduce costs by pre-filtering data so that
only relevant content is forwarded into Splunk
Ingest Logs for Cyber Security:
Integrated and secure log collection for real-
time data analytics and threat detection
Feed Data to Streaming Analytics:
Accelerate big data ROI by streaming data
into analytics systems such as Apache Storm
or Apache Spark Streaming
Move Data Internally:
Optimize resource utilization by moving
data between data centers or between
on-premises infrastructure and cloud
infrastructure
Capture IoT Data:
Transport disparate and often remote IoT
data in real time, despite any limitations
in device footprint, power or
connectivity—avoiding data loss

Hadoop Driver: Enabling the data lakeSCALE
SCOPE
Data Lake Definition
• Centralized Architecture
Multiple applications on a shared data set
with consistentlevels of service
• Any App, Any Data
Multiple applications accessing all data
affording new insights and opportunities.
• Unlocks ‘Systems of Insight’
Advanced algorithms and applications
used to derive new value and optimize
existing value.
Drivers:
1. Cost Optimization
2. Advanced Analytic Apps
Goal:
• Centralized Architecture
• Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight

Case Study: 12 month Hadoop evolution at TrueCar
DataPlatformCapabilities
12 months execution plan
June 2013
Begin
Hadoop
Execution
July 2013
Hortonworks
Partnership
May ‘14
IPO
Aug 2013
Training
& Dev
Begins
Nov 2013
Production
Cluster
60 Nodes
2 PB
Jan 2014
40% Dev
Staff
Perficient
Dec 2013
Three
Production
Apps
(3 total)
Feb 2014
Three More
Production
Apps
(6 total)
12 Month Results at TRUECar
• Six Production HadoopApplications
• Sixty nodes/2PB data
• Storage Costs/Compute Costs
from $19/GB to $0.12/GB
“We addressed our data platform capabilities
strategically as a pre-cursor to IPO.”

Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for
managing large volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by
large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
ü Manages new data paradigm
ü Handles data at scale
ü Cost effective
ü Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce

20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
MapReduce
Largely Batch Processing
Hadoop w/
MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013

Apache Hadoop – Data Operating System
Shared Compute & Workload Management
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
Common & Shared Scale Out Storage
• Shared data assets
• Flexible schema
• Cross workload access
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Enterprise Hadoop

Core Capabilities of Enterprise Hadoop
Load data and
manage according
to policy
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and Data
Protection
DATA MANAGEMENT
SECURITYDATA ACCESS
GOVERNANCE &
INTEGRATION
OPERATIONS
Enable both existing and new application to
provide value to the organization
PRESENTATION & APPLICATION
Empower existing operations and
security tools to manage Hadoop
ENTERPRISE MGMT & SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS

Hortonworks Data Platform 2.3
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
TezTez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
Deployment ChoiceLinux Windows On-Premise Cloud
Data Lifecycle &
Governance
Falcon
Atlas

Basic EDW Cost Optimization Architecture
Batch
Sqoop
Transform
Processed
Hive
Raw
HDFS
Interactive
HiveServer
Reporting
BI Tools
Load
EDW
Existing Analytics
Fetch
1
2
3
4
External
Tables

More than save cost, Enrich With New Data
Batch
Sqoop
Transform
Processed
Hive
Raw
HDFS
Interactive
HiveServer
Reporting
BI Tools
Load
EDW
New Sources
Streaming
NiFi
Load
Existing Analytics
Fetch
New Analytics
1
2
3
4
5
6
External
Tables

Streaming Solution Architecture
HDP 2.x Data Lake
YARN
HDFS
APACHE
KAFKA
Search
Solr
Slider
Online Data
Processing
HBase
Accumulo
Real Time Stream
Processing
Storm SQL
HiveStreaming
Ingest
HDFS
HDP 2.x
Real-time
data feeds

Key Tenants of Lambda Architecture
§ Batch Layer
§ Manages master data
§ Immutable, append-only set of raw data
§ Cleanse, Normalize & Pre-Compute
Batch Views
§ Advanced Statistical Calculations
§ Speed layer
§ Real Time Event Stream Processing
§ Computes Real-Time Views
§ Serving Layer
§ Low-latency, ad-hoc query
§ Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
High Level Big Data IoT Architecture

IoT on HDP
Problem Statement
Reference Architecture
& Sizing
Solution Design
& Customer Case Studies
Implementation Plan
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI

www.hortonworks.com
Ms. Brady knows to get a
handle on sky-rocketing
premiums, she will need to
better understand what is
causing the incidents and
being able to prevent them.
Ms. Brady sets the goal of
reducing incidents by 5%
within 90 days.
Incidents of maintenance vehicles have continued to increase under COO Brady’s watch
2012
17.5M
2013 2014 2015
Insurance Premiums
Ms. Brady tasks, her
Business Analyst, Tam with
gathering the necessary
data to understand the
cause of and reduce
incidents.
Business Analyst
Tam
Mega Corp has a problem

www.hortonworks.com
Given the current premium cost of $3,500 per
truck on 5,000 trucks, a 10% reduction in
incidents will move the company from the high
risk insurance category they are currently in
and save the company $1000 on their
insurance premium per truck per year or
$5,000,000 annually.
Business Analyst
Tam

www.hortonworks.com
Tam considers four questions she
must answer to better understand
and mitigate incidents. The are:
1) Is there a correlation of driver
training to incidents?
2) Is there a correlation of weather
to incidents?
3) Is there a correlation between
certain driving behavior and
incidents?
4) Is is possible to predict incidents
before they occur?
Business Analyst
Tam
…to Behavioral Insight
From reaction to
human activity
…to Resource Optimization
From static resource
planning
From break then fix
Shift from Reactive……to…... Proactive & Proscriptive
…to Preventative Maintenance

www.hortonworks.com
Initially, Tam’s team is concerned that they may not be
able to capture all the necessary data to answer the
questions Tam has posed and help her mitigate
incidents. They know that the data is not all
structured and some of it is created in real-time and
transmitted over the Internet. In addition, some data
will have to be captured from external sources.
Vehicle Data
Route Data
Weather Data
Structured Driver
Data
Semi-Structured
Maintenance Data
SueVarun Jeff

DATASYSTEMS
Enterprise Data
Warehouse
Hot
MPP
In-Memory
1
2
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
RDBMS ERPCRM
Systems of Record
The Team Recognizes The Current Data Architectures
Limits Predictive Capabilities
1. Data Silos: difficult to find
predictive correlations
2. Data Volumes: cannot
store enough data to find
patterns
3. New Data Sources:
unable to capture and use
new data for real-time
analysis
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
3

DATASYSTEMS
Enterprise Data
Warehouse
Hot
MPP
In-Memory
RDBMS ERPCRM
Systems of Record
The Team Leverages HDF & HDP to Expand The
Capabilities of Their Existing Data Platform
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards

www.hortonworks.com
+
HDP Data Analyst
Training
=
HDP Data Analyst
+
Developer Training
=
HDP Developer
+
HDP System Admin
Training
=
HDP Sys Admin
+
Data Science Training
=
HDP Data Scientist
Developer System Admin SME
SueVarun Jeff
Business Analyst
Tam
Then team engages
their favorite SI and
attends Hortonworks
University training to
get the project under
way

IoT on HDP
Problem Statement
& Sizing
Solution Design
Implementation Plan
Project Cost & ROI

Stream Processing & Modeling
(Kafka, Storm & Spark)
Solution Architecture
Distributed Storage: HDFS
Many Workloads: YARN
Real-time Serving &
Searching (Hbase)
Alerts & Events
Real-Time
Web App
Interactive Query
(Hive on Tez)SQL
Single cluster with
consistent security,
governance &
operations
Collect, Conduct & Curate
(HDF – Bidirectional Data Flow)
Truck Sensors
The chosen solution provides XYZ company with the
foundation to capture all the required data, analyze
correlations, and ultimately create a model that allows them
to predict and mitigate incidents before they happen.
Weather Data
EDW
Sqoop

www.hortonworks.com
Tam and Varun build the
application
HDP Analyst
Tam Varun
DeveloperAnalyst

www.hortonworks.com
Ms. Brady is happy with the
results. She is able to
determine that a subset of
drivers are responsible for the
increased cost. But like most
managers she is not happy for
long. Now she wants to be able
to predict future incidents.
Data Scientist
Machine Leaning
Jeff points out that HDP has tremendous statistical algorithm library
and he can use these library to predict which drivers are likely to
have an event before the event occurs.
Jeff

www.hortonworks.com
Jeff implements predicted
violations logic using HDP
Machine Learning
and is able to predict events
before they happen

www.hortonworks.com
Ms. Brady is happy now that
she can isolate where problems
exist, identify causal events
and build models that help
predict events before they
occur.

www.hortonworks.com
< TODO: Show St. Louis Case
Study >
http://hortonworks.com/blog/st-louis-buses-run-with-lhp-telematics-and-hortonworks/

IoT on HDP
Problem Statement
& Sizing
Solution Design
Implementation Plan
Project Cost & ROI

Big Data Functional Architecture
Key Tenants of Lambda Architecture
§ Batch Layer
§ Manages master data
§ Immutable, append-only set of raw data
§ Cleanse, Normalize & Pre-Compute
Batch Views
§ Advanced Statistical Calculations
§ Speed layer
§ Real Time Event Stream Processing
§ Computes Real-Time Views
§ Serving Layer
§ Low-latency, ad-hoc query
§ Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
High Level Big Data IoT Architecture

Storm/Spark Streaming
Storm
Detailed Reference Architecture for IoT Applications
HDF
Flume
Sink to
HDFS
Transform
Interactive
UI Framework
Hive
Hive
HDFS
HDFS
SOURCE DATA
Server logs
Application Logs
Firewall Logs
CRM/ERP
Sensor
Kafka
Kafka
Stream to
HDF
Forward to
Storm
Real Time Storage
Spark-ML
Pig
Alerts
Bolt to
HDFS
Dashboard
Silk
JMS
Alerts
Hive Server
HiveServer
Reporting
BI Tools
High Speed
Ingest
Real-Time
Batch Interactive
Machine Learning
Models
Spark
Pig
Alerts SQOOP
Flume
Iterative ML
Hbase/Pheonix
HBaseEvent Enrichment
Spark-Thrift
Pig

Sample Ingest: NiFi

Apache Storm – Key Attributes
Open source, real-time event stream processing platform that provides fixed,
continuous, & low latency processing for very high frequency streaming data
• Horizontally scalable like Hadoop
• Eg: 10 node cluster can process 1M tuples per secondHighly scalable
• Automatically reassigns tasks on failed nodesFault-tolerant
• Supports at least once & exactly once processing semantics
Guarantees
processing
• Processing logic can be defined in any language
Language
agnostic
• Brand, governance & a large active communityApache project

Storm - Basic Concepts
Spouts: Generate streams.
Tuple: Most fundamental data structure and is a named
list of values that can be of any datatype
Streams: Groups of tuples
Bolts: Contain data processing, persistence and alerting
logic. Can also emit tuples for downstream bolts
Tuple Tree: First spout tuple and all the tuples that were
emitted by the bolts that processed it
Topology: Group of spouts and bolts wired together into a
workflow
Topology

© Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Distributed Database With Apache HBase
100% Open Source
Store and Process Petabytes of Data
Flexible Schema
Scale out on Commodity Servers
High Performance, High Availability
Integrated with YARN
SQL and NoSQL Interfaces
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Dynamic Schema
Scales Horizontally to PB of Data
Directly Integrated with Hadoop
HDP

© Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Apache Phoenix – Relational Database Layer Over HBase
A SQL Skin for HBase
• Provides a SQL interface for managing data in HBase.
• Large subset of SQL:1999 mandatory featureset.
• Create tables, insert and update data and perform low-latency point lookups through JDBC.
• Phoenix JDBC driver easily embeddable in any app that supports JDBC.
Phoenix Makes HBase Better
• Oriented toward online / transactional apps.
• If HBase is a good fit for your app, Phoenix makes it even better.
• Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.

In-Memory With Spark
Spark
SQL
Spark
Streaming
MLlib GraphX
§ A data access engine for fast,
large-scale
data processing
§ Designed for iterative in-
memory computations and
interactive data mining
§ Provides expressive multi-
language APIs for Scala,
Java and Python

Spark ML for machine learning
Democratizes Machine Learning
Unsupervised tasks
• Clustering (K-means)
• Recommendation
• Collaborative Filtering: alternating least squares
• Dimensionality reduction: PCA, SVD
Supervised tasks
• Classification
• Naïve Bayes, Decision Tree, Random Forest, Gradient boosted trees
• Regression
• Linear models (SVM, linear regression, logistic regression)

Apache Hive: SQL in Hadoop
• Created by a team at Facebook
• Provides a standard SQL interface to data stored in Hadoop
• Quickly analyze data in raw data files
• Proven at petabyte scale
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business
Objects, etc…
SensorMobile
Weblog
Operational
/ MPP
SQL Queries

Comparing SQL Options In HDP
Project Strengths Use Cases Unique Capabilities
Apache Hive Most comprehensive SQL
Scale
Maturity
ETL Offload
Reporting
Large-scale aggregations
Robust cost-based optimizer
Mature ecosystem (BI,
backup, security and
replication)
SparkSQL In-memory
Low latency
Exploratory analytics
Dashboards
Language-integrated Query
Apache Phoenix Real-time read / write
Transactions
High concurrency
Dashboards
System-of-engagement
Drill-down / Drill-up
Real-time read / write

Comparing Streaming Options In HDP
Apache Storm Spark Streaming
One At A Time Micro Batch (minimum batch latency = 500 ms)
Low Latency Higher Throughput
Operates on Tuple Stream Operates on Streams of Tuple Batches
At Least Once
(Trident For Exactly Once)
Exactly Once
Multiple Language Support Multiple Language Support

Sizing

HDF Sizing & Best Practices Sustained Throughput
For Sustained
Throughput of 50MB/sec
and thousands of events
per second
• 1-2 nodes
• 8+ cores per node
(more is better)
• 6+ disks per node
(SSD or Spinning)
• 2 GB of mem per node
• 1GB bonded NICs
ideally
For Sustained
Throughput of
100MB/sec and tens of
thousands of events per
second
• 3-4 nodes
(more is better)
(SSD or Spinning)
• 2 GB of mem per node
• 1GB bonded NICs
ideally
For Sustained
Throughput of
200MB/sec and
hundreds of thousands
of events per second
• 5-7 nodes
(effective cpus)
(SSD or spinning)
• 4GB of mem per node
• 10GB bonded NICs
For Sustained
Throughput of 400-
500MB/sec and
hundreds of thousands
of events per second
• 7-10 nodes
(effective cpus)
(SSD or spinning)
• 6GB of mem per node
• 10GB bonded NICs

Kafka - Sizing & Best Practices
§ Cluster Sizing – Rule of Thumb
– 10 MB/sec/Node or 100,000/sec/Node
• Higher throughput for large batch size
§ Configuration Best Practices
– Num Of Partitions = max (Total Producer Throughput / Throughput per partition, Total Consumer
Throughput / Throughput per partition)
• Over-estimate number of partitions per topic. Cannot increase partition count without breaking
message ordering guarantees
– Collocate Kafka and Storm process
• Storm is CPU bound while Kafka is throughput bound
• In high throughput scenarios, separate Kafka and Storm into independent nodes.

Storm - Sizing & Best Practices
– 100,000 events per second per supervisor node
• Predicated on work being performed by Bolt’s execute method
• Mileage will vary by project
• Testing is critical
– 1 Worker / Machine / Topology
– 1 Executor per CPU Core
– Topology Parallelism = Num of Machines x (Num of Cores Per Machine -1 )
• Distribute total parallelism among spout and bolts to maximize topology throughput

Hbase - Sizing & Best Practices
– 10 MB/sec/node of Write Throughput
– 1-3 TB per node of compressed data (non replicated)
• HDFS volume of 6-12 TB
– Sizing = max(required ingestion rate / Write Throughput per node, Total data size/ Data Per Node)
– Region Server Size ~ 10G
– Number of Regions Per Region Server ~ 100-200
– Cluster/Pre-Split tables
– For IOT scenarios
• Consider using Hive to store raw data while using Phoenix to store aggregates
• Batch insert data to Phoenix using MapReduce
– Tailor Batch interval to application SLAs

www.hortonworks.com
Ms. Brady knows to get a
handle on sky-rocketing
premiums, she will need to
better understand what is
causing the incidents and
being able to prevent them.
Ms. Brady sets the goal of
reducing incidents by 5%
within 90 days.
Incidents of maintenance vehicles have continued to increase under
COO Brady’s watch. The Department of Transportation has contacted
Mega Corporation.
2012
17.5M
2013 2014 2015
Insurance Premiums
Ms. Brady tasks, her
Business Analyst, Tam with
gathering the necessary
data to understand the
cause of and reduce
incidents.
Business Analyst
Tam
Problem statement recap

www.hortonworks.com
Given the current premium cost of $3,500 per
truck on 5,000 trucks, a 10% reduction in
incidents will move the company from the high
risk insurance category they are currently in
and save the company $1000 on their
insurance premium per truck per year or
$5,000,000 annually.
Business Analyst
Tam
Problem statement recap

Sizing - Cluster Storage Requirement
Effective
Capacity
× Intermediate Size
× Replication Count
× Temp Space
Compression Ratio
Rule of thumb
§ Replication Count: 3
§ Temp Space: x1.2
Vary greatly
§ Intermediate/Materialized: 30-50%
§ Compression Ratio: 2-4

Data Volume for Mega Corp
§ Number of Trucks = 5000
§ Events per second per truck = 10
§ Size of each event = 128 Bytes
§ 1 year raw sensor data storage requirements: 5000 x 10 x 128 x 60 x 60 x 24 x 365 = 200 TB
§ 5 year sensor data storage: 200TB X 5 X 1.5 (processing overhead) = 1.5 PB
§ Q: How many nodes are needed for storing 1.5PB? (answered later)

HBase, Kafka, Storm and NiFi Requirements
Ingest rate = 128 Bytes X 5000 trucks X 10 events/s = 6.4 KB/s
Q: For 6.4 KB/s ingest rate, how many NiFi, Kafka and Storm nodes are needed?
We will store last 15 days of data in Hbase.
Hbase storage needed: 5000 * 10 * 60 * 60 * 24 * 15 * 128 = 8.2 TB
Q: How many Hbase nodes are needed for 8.2TB storage?

Sizing - Number Of Worker Nodes for Sensor Data
§ # of Worker Nodes = = = 32
Storage Per Server
Total Cluster Storage 1.5 PB
48 TB

Sizing – NiFi, Kafka, Hbase and Storm Nodes
DataNodes
& Hbase
NiFi Kafka & Storm
Ingest Nodes
Client
Nodes
Master
Nodes
Total
32 2 3 2 5 44
§ Recall that:
§ NiFi can collect @ 50 MB/s/node
§ Kafka can ingest @10MB/s/node or 100,000 events/s/node
§ Storm can process @ 100,000 events/s/node
§ Each HBase Region Server can store 1TB
§ So for 6.4 KB/s ingest rate: 1 NiFi , 1 Kafka, 1 Storm nodes are sufficient.
§ We will use 2 NiFi & 3 Kafka for HA.
§ Hbase nodes needed = 1.5PB/1TB = 8 nodes
§ Co-locate Kafka and Storm.
§ Co-locate DataNode and Hbase.

www.hortonworks.com
NiFi 1
NiFi 2
Storm 1
Kafka 1
Storm 2
Kafka 2
Storm 3
Kafka 3
DataNode 1
HBase 1
Truck 1
Truck 2
Truck 3
Truck
5000
NiFi Nodes
Edge Nodes
Master NodesClients 1
Clients 2
DataNode 2
Hbase 2
DataNode 3
Hbase 3
DataNode 4
Hbase 4
DataNode 5
Hbase 5
DataNode 6
Hbase 6
DataNode 7
Hbase 7
DataNode 8
Hbase 8
DataNode 9 DataNode 10
DataNode 31 DataNode 32
Master 1
Master 2
Master 3
Master 4
Master 5
Worker Nodes
HDF
HDP
World
Megacorp
Datacenter

Ingest Node 1Master Node 4
StormHiveserver
WebHCat
Falcon
Worker Node 1
Node
Manager
Datanode
hBase
Region
Worker Node 2
Node
Manager
Datanode
hBase
Region
Worker Node 3
Node
Manager
Datanode
hBase
Region
Worker Node 4
Node
Manager
Datanode
hBase
Region
Worker Node 5
Node
Manager
Datanode
hBase
Region
hBase
Master 1
Master Node 3Master Node 2Master Node 1
Namenode
1
Zookeeper
Oozie
Zookeeper
Namenode
2
Resource
Manager 1
Zookeeper
History
Server
Timeline
Server
Hiveserver
2
Journal
Keeper
Journal
Keeper
Journal
Keeper
Resource
Manager 2
hBase
Master 2
Kafka
Master Node 5
Zookeeper
History
Server
Ambari
Monitoring
& Metrics
Worker Node 32
Node
Manager
Datanode
hBase
Region
Ingest Node 2
Storm
Kafka
Ingest Node 3
Storm
Kafka
Edge Node 1
Clients
Knox
Edge Node 1
Clients
Knox
HDP Service Layout

Master Node Specs
12 + Cores
128 - 256 GB RAM
(1 X 256GB SSD Drive for OS)
(2 X 1TB Drives)
2 X 1 – 10 Gb Switch
Approximate Cost Per Node $8,000 - $18,000

NiFi Nodes Specs
8+ Cores
16 GB RAM
(1 X 256GB SSD Drive for OS)
(2 X 1TB Drives)

Slave Nodes Specs
12 + Cores
32 - 64 GB RAM
12 X 1 TB SATA Drives (Processing/IOPS Optimized)
12 X 2 TB SATA Drives (Balanced)
12 X 4 TB SATA Drives (Storage Optimized)

IoT on HDP
Problem Statement
& Sizing
Solution Design
Implementation Plan
Project Cost & ROI

Project Plan
Strategy
10 days
Training
10 days
Design & Build
60 days
Test
30 days
Promote
10 days
Use Case Workshop
Cluster Build-out
Solution Build-out
Prove-out
Promote Solution
Tam puts together a
quick project plan and
estimates it will take 120
days to get Ms. Brady
her solution

www.hortonworks.com
75
Resource Plan
Data Scientist
Consultant
Tam
Data Flow
Consultant
Varun
Architect
Consultant
Jeff
Developer
Consultant
Sue
Project Manager
Jen
Engagement Manager
Consultant
Jim
Enterprise Architect
Frank
Business Analyst
Sue
Developer
Jim

IoT on HDP
Problem Statement
& Sizing
Solution Design
Implementation Plan
Project Cost & ROI

Project Cost
Component Quantity Unit Cost Total Cost
Hardware 44 $10,000 $440K
Software – HDP 11 SKUs $18,000/SKU $198K
Software – HDF 2 SKUs $36000/SKU $72K
Dev and Test
Consulting
3040 hrs* $300/hr $912K
Engagement
Consulting
360 hrs* $300/hr $108K
Training 30** $2500 $75K
Travel & Expense $100K
Total $1.885M
* 4 resources x 8 hrs x 95 days, engagementmgr for 45 days
** Admin,Analyst & Data Science Training for 30 associates

Project ROI
§ Insurance Cost Reduction – 5M
§ Project Cost – 1.885M
§ First year savings ~ 3.1M

Solving Big Data Problems using Hortonworks

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Solving Big Data Problems using Hortonworks

Semelhante a Solving Big Data Problems using Hortonworks (20)

Mais de DataWorks Summit/Hadoop Summit

Mais de DataWorks Summit/Hadoop Summit (20)

Último

Último (20)

Solving Big Data Problems using Hortonworks