2. ONLY
100open source
Apache Hadoop data platform
% Founded in 2011
HADOOP
1ST
provider to go public
IPO 4Q14 (NASDAQ: HDP)
employees across
800+
countries
technology partners
1,350
17
TM
Hortonworks Company Profile
Fastest company to reach $100 M in revenue
3. Let’s talk about Big Data
, September 2014 survey of 100 CIOs from the US and Europe
4. What problems and opportunities does Big Data
create?
Data that
traditional
platforms
cannot handleNEW
TRADITIONAL
The Opportunity
Unlock transformational business value
from a full fidelity of data and analytics
for all data.
Geolocation
Server logs
Files & emails
ERP, CRM, SCM
Traditional Data Sources
New Data Sources
Sensors
and machines
Clickstream
Social media
5. The Future of Data: Actionable Intelligence
D A T A I N M O T I O N
STORAGE
STORAGE
GR OU P 2GR OU P 1
GR OU P 4GR OU P 3
D A T A
A T R E S T
INTERNET
OF
ANYTHING
6. Hortonworks Data Platform
H O R T O N W O R K S D ATA P L AT F O R M
Batch Interactive Search Streaming Machine Learning
YARN Resource Management System
CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATIONS SERVER LOG EXISTING
7. HDP is a collection of Apache Projects
HORTONWORKS DATA PLATFORM
Hadoop&
YARN
Flume
Oozie
Pig
Hive
Tez
Sqoop
Cloudbreak
Ambari
Slider
Kafka
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.01.7.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
1.3.1
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.3
July 2015
4.2.0
Ongoing Innovation in Apache
0.96.1
0.98.0 0.9.1
0.8.1
8. Hortonworks Data Flow
Visual User Interface
Drag and drop for efficient, agile operations
Immediate Feedback
Start, stop, tune, replay dataflows in real-time
Adaptive to Volume and Bandwidth
Any data, big or small
Event Level Data Provenance
Governance, compliance & data evaluation
Secure Data Acquisition & Transport
Fine grained encryption for controlled data
sharing and selective data democratization
Powered by
Apache NiFi
9. HDF and HDP Deliver a Complete Big Data Solution
• HDF dynamically connects HDP to
data at the edge
• HDF secures and encrypts the
movement of data into HDP
• HDF includes mature IoAT data
protocols that improve device
extensibility
• HDF supports easily adjustable bi-
direction IoAT dataflows
• HDF offers traceability of IoAT data
with lineage and audit trails
• HDF brings a real-time, visual user
interface to manipulate live dataflows
10. STORAGE
STORAGE
Hortonworks Revenue Model
HDP and HDF are 100% free and
Open Source – no license.
Our customers subscribe to
support, consulting experts and
training programs
Annual Subscriptions
align your success with ours
Expert Consulting & Training
help your team get to actionable intelligence
as efficiently as possible
ARCHITECT
&
DEVELOP
DEPLOY
OPERATE
Project 1
Project 5
Project 4
Project 3
Project 2
Project 6
EXPAND
12. Hadoop Driver: Cost optimization
Archive Data off EDW
Move rarely used data to Hadoop as active
archive, store more data longer
Offload costly ETL process
Free your EDW to perform high-value functions like
analytics & operations, not ETL
Enrich the value of your EDW
Use Hadoop to refine new data sources, such as
web and machine data for new analytical context
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP helps you reduce costs and optimize the value associated with your EDW
ANALYTICSDATASYSTEMS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP 2.3
ELT
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data,
Deeper Archive
& New Sources
Enterprise Data
Warehouse
Hot
MPP
In-Memory
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
Existing Systems
ERP CRM SCM
SOURCES
13. Single View
Improve acquisition and retention
Predictive Analytics
Identify your next best action
Data Discovery
Uncover new findings
Financial Services
New Account Risk Screens Trading Risk Insurance Underwriting
Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service
Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement
Telecom
Unified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse
Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis
Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers
Retail
360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase
Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs
Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior
Manufacturing
Supply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data
Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance
Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields
Healthcare
Electronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials
Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste
Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service
Oil & Gas
Unify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration
DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells
Government
Single View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness
Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting
Hadoop Driver: Advanced analytic applications
14. NiFi and HDF Drivers
Optimize Splunk:
Reduce costs by pre-filtering data so that
only relevant content is forwarded into Splunk
Ingest Logs for Cyber Security:
Integrated and secure log collection for real-
time data analytics and threat detection
Feed Data to Streaming Analytics:
Accelerate big data ROI by streaming data
into analytics systems such as Apache Storm
or Apache Spark Streaming
Move Data Internally:
Optimize resource utilization by moving
data between data centers or between
on-premises infrastructure and cloud
infrastructure
Capture IoT Data:
Transport disparate and often remote IoT
data in real time, despite any limitations
in device footprint, power or
connectivity—avoiding data loss
15. Hadoop Driver: Enabling the data lakeSCALE
SCOPE
Data Lake Definition
• Centralized Architecture
Multiple applications on a shared data set
with consistentlevels of service
• Any App, Any Data
Multiple applications accessing all data
affording new insights and opportunities.
• Unlocks ‘Systems of Insight’
Advanced algorithms and applications
used to derive new value and optimize
existing value.
Drivers:
1. Cost Optimization
2. Advanced Analytic Apps
Goal:
• Centralized Architecture
• Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight
16. Case Study: 12 month Hadoop evolution at TrueCar
DataPlatformCapabilities
12 months execution plan
June 2013
Begin
Hadoop
Execution
July 2013
Hortonworks
Partnership
May ‘14
IPO
Aug 2013
Training
& Dev
Begins
Nov 2013
Production
Cluster
60 Nodes
2 PB
Jan 2014
40% Dev
Staff
Perficient
Dec 2013
Three
Production
Apps
(3 total)
Feb 2014
Three More
Production
Apps
(6 total)
12 Month Results at TRUECar
• Six Production HadoopApplications
• Sixty nodes/2PB data
• Storage Costs/Compute Costs
from $19/GB to $0.12/GB
“We addressed our data platform capabilities
strategically as a pre-cursor to IPO.”
18. Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for
managing large volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by
large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
ü Manages new data paradigm
ü Handles data at scale
ü Cost effective
ü Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce
19. 20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
MapReduce
Largely Batch Processing
Hadoop w/
MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013
20. Apache Hadoop – Data Operating System
Shared Compute & Workload Management
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
Common & Shared Scale Out Storage
• Shared data assets
• Flexible schema
• Cross workload access
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Enterprise Hadoop
21. Core Capabilities of Enterprise Hadoop
Load data and
manage according
to policy
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and Data
Protection
DATA MANAGEMENT
SECURITYDATA ACCESS
GOVERNANCE &
INTEGRATION
OPERATIONS
Enable both existing and new application to
provide value to the organization
PRESENTATION & APPLICATION
Empower existing operations and
security tools to manage Hadoop
ENTERPRISE MGMT & SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS
22. Hortonworks Data Platform 2.3
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
TezTez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
Deployment ChoiceLinux Windows On-Premise Cloud
Data Lifecycle &
Governance
Falcon
Atlas
25. More than save cost, Enrich With New Data
Batch
Sqoop
Transform
Processed
Hive
Raw
HDFS
Interactive
HiveServer
Reporting
BI Tools
Load
EDW
New Sources
Streaming
NiFi
Load
Existing Analytics
Fetch
New Analytics
1
2
3
4
5
6
External
Tables
26. Streaming Solution Architecture
HDP 2.x Data Lake
YARN
HDFS
APACHE
KAFKA
Search
Solr
Slider
Online Data
Processing
HBase
Accumulo
Real Time Stream
Processing
Storm SQL
HiveStreaming
Ingest
HDFS
HDP 2.x
Real-time
data feeds
27. Key Tenants of Lambda Architecture
§ Batch Layer
§ Manages master data
§ Immutable, append-only set of raw data
§ Cleanse, Normalize & Pre-Compute
Batch Views
§ Advanced Statistical Calculations
§ Speed layer
§ Real Time Event Stream Processing
§ Computes Real-Time Views
§ Serving Layer
§ Low-latency, ad-hoc query
§ Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
High Level Big Data IoT Architecture
29. www.hortonworks.com
Ms. Brady knows to get a
handle on sky-rocketing
premiums, she will need to
better understand what is
causing the incidents and
being able to prevent them.
Ms. Brady sets the goal of
reducing incidents by 5%
within 90 days.
Incidents of maintenance vehicles have continued to increase under COO Brady’s watch
2012
17.5M
2013 2014 2015
Insurance Premiums
Ms. Brady tasks, her
Business Analyst, Tam with
gathering the necessary
data to understand the
cause of and reduce
incidents.
Business Analyst
Tam
Mega Corp has a problem
30. www.hortonworks.com
Given the current premium cost of $3,500 per
truck on 5,000 trucks, a 10% reduction in
incidents will move the company from the high
risk insurance category they are currently in
and save the company $1000 on their
insurance premium per truck per year or
$5,000,000 annually.
Business Analyst
Tam
31. www.hortonworks.com
Tam considers four questions she
must answer to better understand
and mitigate incidents. The are:
1) Is there a correlation of driver
training to incidents?
2) Is there a correlation of weather
to incidents?
3) Is there a correlation between
certain driving behavior and
incidents?
4) Is is possible to predict incidents
before they occur?
Business Analyst
Tam
…to Behavioral Insight
From reaction to
human activity
…to Resource Optimization
From static resource
planning
From break then fix
Shift from Reactive……to…... Proactive & Proscriptive
…to Preventative Maintenance
32. www.hortonworks.com
Initially, Tam’s team is concerned that they may not be
able to capture all the necessary data to answer the
questions Tam has posed and help her mitigate
incidents. They know that the data is not all
structured and some of it is created in real-time and
transmitted over the Internet. In addition, some data
will have to be captured from external sources.
Vehicle Data
Route Data
Weather Data
Structured Driver
Data
Semi-Structured
Maintenance Data
SueVarun Jeff
35. www.hortonworks.com
+
HDP Data Analyst
Training
=
HDP Data Analyst
+
Developer Training
=
HDP Developer
+
HDP System Admin
Training
=
HDP Sys Admin
+
Data Science Training
=
HDP Data Scientist
Developer System Admin SME
SueVarun Jeff
Business Analyst
Tam
Then team engages
their favorite SI and
attends Hortonworks
University training to
get the project under
way
39. www.hortonworks.com
Ms. Brady is happy with the
results. She is able to
determine that a subset of
drivers are responsible for the
increased cost. But like most
managers she is not happy for
long. Now she wants to be able
to predict future incidents.
Data Scientist
Machine Leaning
Jeff points out that HDP has tremendous statistical algorithm library
and he can use these library to predict which drivers are likely to
have an event before the event occurs.
Jeff
41. www.hortonworks.com
Ms. Brady is happy now that
she can isolate where problems
exist, identify causal events
and build models that help
predict events before they
occur.
42. www.hortonworks.com
< TODO: Show St. Louis Case
Study >
http://hortonworks.com/blog/st-louis-buses-run-with-lhp-telematics-and-hortonworks/
61. www.hortonworks.com
Ms. Brady knows to get a
handle on sky-rocketing
premiums, she will need to
better understand what is
causing the incidents and
being able to prevent them.
Ms. Brady sets the goal of
reducing incidents by 5%
within 90 days.
Incidents of maintenance vehicles have continued to increase under
COO Brady’s watch. The Department of Transportation has contacted
Mega Corporation.
2012
17.5M
2013 2014 2015
Insurance Premiums
Ms. Brady tasks, her
Business Analyst, Tam with
gathering the necessary
data to understand the
cause of and reduce
incidents.
Business Analyst
Tam
Problem statement recap
62. www.hortonworks.com
Given the current premium cost of $3,500 per
truck on 5,000 trucks, a 10% reduction in
incidents will move the company from the high
risk insurance category they are currently in
and save the company $1000 on their
insurance premium per truck per year or
$5,000,000 annually.
Business Analyst
Tam
Problem statement recap