SlideShare uma empresa Scribd logo
1 de 61
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real-Time Processing in Hadoop
Phoenix Hadoop User Group
Shane Kumpf & Mac Moore
Solutions Engineers, Hortonworks
July 2015
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Agenda
 Introduction & about Hortonworks HDP
 Overview of logistics industry scenario
 Overview of streaming architecture on HDP
 Streaming Demo #1
 Integrating Predictive Analytics in streaming scenarios
 Streaming Demo with Predictive additions
 Q & A
Page 2
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Preface: Enabling Technologies
Page 3
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Preface: Enabling Technologies
Page 4
Enablers: Key technologies from mass consumer-scale deployments.
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Preface: Enabling Technologies
Page 5
• Problems solved at scale, via fundamentally new approaches…
• Make it possible, even simple, to produce new products/applications that would
have been too cost prohibitive – or simply impossible - beforehand.
• Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras
(from smartphones) have enabled Electric cars, quad-copters, VR displays, & more…
• Hadoop has similarly led to breakthroughs in big data scale & capability, and enables
new real-time advanced analytic applications.
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why did Hadoop emerge?
April 2015
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Traditional systems under pressure
Challenges
• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012
2.8 Zettabytes
2020
40 Zettabytes
LAGGARDS
INDUSTRY
LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise: Implement a
Modern Data Architecture with HDP
Spring 2015
Hortonworks. We do Hadoop.
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise:
Implement a Modern Data Architecture with HDP
Customer Momentum
• 430+ customers (Q1 2015)
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1000+ Ecosystem Partners
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Customer Partnerships matter
Driving our innovation through
Apache Software Foundation Projects
Apache Project Committers
PMC
Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 34 27
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 10 n/a
TOTAL 161 108
Source: Apache Software Foundation. As of 11/7/2014.
Hortonworkers are the architects and
engineers that lead development of open
source Apache Hadoop at the ASF
• Expertise
Uniquely capable to solve the most complex issues &
ensure success with latest features
• Connection
Provide customers & partners direct input into
the community roadmap
• Partnership
We partner with customers with subscription offering.
Our success is predicated on yours.
27
Cloudera: 11
Facebook: 5
LinkedIn: 2
IBM: 2
Others: 23
Yahoo
10
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Technology Partnerships matter
Apache Project Hortonworks
Relationship
Named
Partner
Certified
Solution
Resells
Joint
Engr
Microsoft    
HP    
SAS   
SAP    
IBM   
Pivotal   
Redhat   
Teradata    
Informatica   
Oracle  
It is not just about
packaging and certifying
software…
Our joint engineering
with our partners drives
open source standards
for Apache Hadoop
HDP is
Apache Hadoop
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a Centralized Architecture
Modern Data Architecture
• Unifies data and processing.
• Enables applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a completely open data platform
Hortonworks Data Platform 2.2
Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core
enterprise services, for any application and any data.
Completely Open
• HDP incorporates every element
required of an enterprise data
platform: data storage, data access,
governance, security, operations
• All components are developed in
open source and then rigorously
tested, certified, and delivered as an
integrated open source platform that’s
easy to consume and use by the
enterprise and ecosystem.
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
ApachePig
° °
° °
° ° °
° ° °
HDFS
(Hadoop Distributed File System)
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
ApacheHive
Cascading
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
Apache Sqoop
Apache Flume
Apache Kafka
SECURITY
Apache Ranger
Apache Knox
Apache Falcon
OPERATIONS
Apache Ambari
Apache
Zookeeper
Apache Oozie
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real World Use Case:
Trucking Company
Spring 2015
Hortonworks. We do Hadoop.
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Scenario Overview
.
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for a
given route; an event could be:
 'Normal' events: starting / stopping of the
vehicle
 ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Company uses an application that monitors
truck locations and violations from the
truck/driver in real-time
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
What is Kafka? APACHE KAFKA
 High throughput distributed messaging
system
 Publish-Subscribe semantics but re-
imagined at the implementation level to
operate at speed with big data volumes
 Kafka @LinkedIn:
 800 billion messages per day
 175 terabytes of data written per day
 650 terabytes of data read per day
 Over 13 million messages/2.75GB of data
per second
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Kafka: Anatomy of a Topic
Partition 0 Partition 1 Partition 2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
APACHE KAFKA
 Partitioning allows topics to
scale beyond a single
machine/node
 Topics can also be replicated,
for high availability.
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Apache Storm
• Distributed, real time, fault tolerant Stream Processing platform.
• Provides processing guarantees.
• Key concepts include:
•Tuples
•Streams
•Spouts
•Bolts
•Topology
Page 22
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 23
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Spouts
• What is a Spout?
–Generates or a source of Streams
–E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 24
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Bolts
• What is a Bolt?
–Processes any number of input streams and produces output streams
–Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting
logic
–Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:
1. HBaseBolt: persisting and counting in Hbase
2. HDFSBolt: persisting into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 25
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Page 26
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Key Constructs in Apache HBase
• HBase = Key / Value store
• Designed for petabyte scale
• Supports low latency reads, writes and updates
• Key features
– Updateable records
– Versioned Records
– Distributed across a cluster of machines
– Low Latency
– Caching
• Popular use cases:
– User profiles and session state
– Object store
– Sensor apps
Page 28
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Data Assignment
Page 29
HBase Table
Keys within HBase
Divided among
different RegionServers
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Data Access
• Get
–Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with
a matching rowkey
• Put
–Inserts a new version of a cell.
• Scan
–The whole table, row by row, or a section of that table starting at a particular start key and
ending at a particular end key
• Delete
–It is actually a version of put(Add a new version with put with a deletion marker)
• SQL via Apache Phoenix
–Unique capability in the NoSQL market
Page 30
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
MapReduce
Largely Batch Processing
Hadoop w/
MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Benefits of YARN as the Data Operating System
• The container based model allows for running nearly any workload.
–Enables the centralized architecture.
–No longer is MapReduce the only data processing engine.
–Docker containers managed by YARN. Yes Please!
• Decouples resource scheduling from application lifecycle.
–Improved scalability and fault tolerence
• Dynamically allocated resources, resulting in HUGE utilization gains
–Versus static allocation of “slots” in Hadoop 1.0
Page 33
Yahoo has over 30000 nodes running YARN across over 365PB of data.
They calculate running about 400,000 jobs per day for about 10 million hours of compute time.
They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.
Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors
Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Apache HDFS – Hadoop Distributed File
System
• Very large scale distributed file system
• 10K nodes, tens of millions files and PBs of data
• Supports large files
• Designed to run on commodity hardware, assumes hardware failures
• Files are replicated to handle hardware failure
• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing
• Data locations are exposed so that the computations can move to where data resides
• Data Coherency
• Write once and read many times access pattern
• Files are broken up in chunks called ‘blocks’
• Blocks are distributed over nodes
Page 35
Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Streaming Demo - High Level Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
Dangerous
Events Table
Hbase
Bolt
HDFS
Bolt
Truck Events
Active
MQ
Monitoring
Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging
(Kafka)
Truck Events Topic
Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Streaming Dashboard
.
Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A New Challenge
.
Page39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
CDO’s vision: Build a Predictive Business, not a Reactive one
CDO’s Requirements
 Offline predictions
 Identify investments that will increase
safety and reduce company’s liabilities
 Real-time predictions
 Anticipate driver violations before they
happen and take precautionary actions
Data Scientist’s Response
 Need to explore data & form a hypothesis
 Verify trends against TBs of events data via
machine learning
 Generate predictive models with Spark
MLlib on HDP
 Plug models into the Storm topology to predict
driver violations in real-time
♬ I’ve been waiting for
this moment all my life ♬
Page40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo – Analyzing Events with Tableau
.
Page41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Raw Events – dangerous drivers
Page 41
Page42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Raw Events – dangerous routes
Page 42
Page43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Raw Events – violations by location
Page 43
Page44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enriching truck events for analysis with Pig
HDFS Raw Truck EventsWeather Data Sets
Raw Weather Data
HCatalog (Metadata)
Payroll Data
HR & Payroll DBs
Load Raw Truck
Events
Clean &
Filter
Cleaned
Events
Transformed
Events
Transform
Join with
HR & weather data
Enriched
Events
Enriched Events
Store
Tableau
Page45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Enriched Events – noncertified and fatigued
drivers more dangerous
Page 45
Page46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Enriched Events – top 3 dangerous routes seem
to be driven by fatigued drivers
Page 46
Page47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Enriched Events – foggy weather leads to
violations
Page 47
Page48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Analyzing Enriched Events – but top 3 safest routes are
also foggy
Page 48
Page49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Integrating Predictive Analytics
Page50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Building the Predictive Model on HDP
Tableau
Explore small subset of events to identify predictive
features and make a hypothesis. E.g. hypothesis: “foggy
weather causes driver violations”
1
Identify suitable ML algorithms to train a model – we will
use classification algorithms as we have labeled events
data
2
Transform enriched events data to a format that is
friendly to Spark MLlib – many ML libs expect
training data in a certain format
3
Train a logistic regression model in Spark on YARN, with
above events as training input, and iterate to fine tune
the generated model
4
Integrate Spark MLlib model in a Storm bolt to predict
violations in real time
5
Page51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Truck Sensors
HDFS
YARN
Integrate Predictive Analytics in Stream Processing
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Interactive Query
(Hive on Tez)
Real-time Serving
(HBase)
Millions of Enriched Truck Events
Prediction Bolt
Plug Spark model
into Storm bolt
Machine Learning
(Spark)
Train Spark ML model with
millions of truck events
Page52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2012
Professional Services
Streaming Demo - Updated Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
PayRoll
TableHBase
Bolt
HDFS
Bolt
Truck Events
Active
MQ
Monitoring
Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging
(Kafka)
Truck Events Topic
Prediction
Bolt
Enrich
Event
Predict
violation in
real time &
alert via MQ
Render Real time
predictions on UI
Page53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transforming training data for Spark MLlib
Enriched Events Data
Event Type Is Driver
Certified?
Wage
Plan
Hours
Driven
Miles
Driven
Longitude Latitude Weather
Foggy
Weather
Rainy
Weather
Windy
Normal Yes Hourly 45 2721 -91.3 38.14 No No No
Overspeed No Miles 72 4152 -94.23 37.09 Yes Yes No
… … … … … … … … … …
Spark MLlib Training Data
Label Is Driver
Certified?
Wage
Plan
Hours
Driven
Miles
Driven
Weather
Foggy
Weather
Rainy
Weather
Windy
0 1 1 0.45 0.2721 0 0 0
1 0 0 0.72 0.4152 1 1 0
… … … … … … … …
Normal events
labeled as 0 and
violation events as 1
Feature scaling applied to
hours and miles to improve
algorithm performance
Features with binary values
denoted as 0 and 1
Page54 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Running Spark ML on YARN
1
spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master yarn-cluster --
num-executors 3 --driver-memory 512m --executor-memory 512m
--executor-cores 1 truckml.jar --algorithm LR --regType L2 --regParam 1.0 /user/root/truck_training
--numIterations 100
Run spark-submit script to launch a Spark job on YARN.
Training data
location on HDFS
2 Monitor progress of Spark job in YARN Resource Mgr UI
Page55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Interpreting Spark Logistic Regression Results
Precision: 87.5% Recall: 88%
Top three predictors of violations
1. Foggy Weather 2. Rainy Weather 3. Driver Certification
Page56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Integrating Spark model in Storm
Kafka Spout
Storm Prediction Bolt
 Initialize Spark model
 Parse truck event
 Enrich event with HBase data
 Predict violation with model
 Send Alert if violation predicted
Real-time Serving
(HBase)
Active MQ
Ops Center LOB Dashboards
Page57 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary: Solution Value
.
Page58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Value of large scale ML on HDP
 Accelerate time to market/value
 Test out multiple ML algorithms against TBs of training data in reasonable
time frames
 Confirm hypothesis against TBs of training data with confidence
 We confirmed that fog does impact safety and wage plans do not,
whereas BI tools indicated otherwise
 Easily integrate predictive models in data driven apps
 Run predictive models in Storm or any other app in your enterprise
 Run all of the above in a multi-tenant YARN cluster
 Large scale ML on YARN respects other tenants in an HDP cluster
Page59 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Recommendations to CDO
 Investment recommendations, in order of priority
1. Invest in visibility sensors and auto braking systems to deal with foggy conditions
2. Invest in slip resistant tires to fight rainy conditions
3. Invest in certifying drivers to reduce violation probability
 Power of real time predictions
 40% reduction in violation rates by predicting high risk situations in real-time and
sending immediate alerts to drivers
Page60 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Predictive Demo
.
Page61 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q & A

Mais conteúdo relacionado

Mais procurados

Analytics Modernization: Configuring SAS® Grid Manager for Hadoop
Analytics Modernization: Configuring SAS® Grid Manager for HadoopAnalytics Modernization: Configuring SAS® Grid Manager for Hadoop
Analytics Modernization: Configuring SAS® Grid Manager for HadoopHortonworks
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Streamline Apache Hadoop Operations with Apache Ambari and SmartSense
Streamline Apache Hadoop Operations with Apache Ambari and SmartSenseStreamline Apache Hadoop Operations with Apache Ambari and SmartSense
Streamline Apache Hadoop Operations with Apache Ambari and SmartSenseHortonworks
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Powering Big Data Success On-Prem and in the Cloud
Powering Big Data Success On-Prem and in the CloudPowering Big Data Success On-Prem and in the Cloud
Powering Big Data Success On-Prem and in the CloudHortonworks
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsHortonworks
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks
 
The Destiny of Data
The Destiny of DataThe Destiny of Data
The Destiny of DataHortonworks
 
Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?Hortonworks
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
 

Mais procurados (20)

Analytics Modernization: Configuring SAS® Grid Manager for Hadoop
Analytics Modernization: Configuring SAS® Grid Manager for HadoopAnalytics Modernization: Configuring SAS® Grid Manager for Hadoop
Analytics Modernization: Configuring SAS® Grid Manager for Hadoop
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts Presentation
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Streamline Apache Hadoop Operations with Apache Ambari and SmartSense
Streamline Apache Hadoop Operations with Apache Ambari and SmartSenseStreamline Apache Hadoop Operations with Apache Ambari and SmartSense
Streamline Apache Hadoop Operations with Apache Ambari and SmartSense
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Powering Big Data Success On-Prem and in the Cloud
Powering Big Data Success On-Prem and in the CloudPowering Big Data Success On-Prem and in the Cloud
Powering Big Data Success On-Prem and in the Cloud
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
 
Falcon Meetup
Falcon Meetup Falcon Meetup
Falcon Meetup
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
 
The Destiny of Data
The Destiny of DataThe Destiny of Data
The Destiny of Data
 
Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 

Destaque

Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 

Destaque (8)

Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 

Semelhante a Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Hortonworks
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramHortonworks
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
 
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoopTrafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoopKrishna-Kumar
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformEMC
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupMats Johansson
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 

Semelhante a Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG (20)

Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
 
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoopTrafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoop
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User Group
 
Meetup oslo hortonworks HDP
Meetup oslo hortonworks HDPMeetup oslo hortonworks HDP
Meetup oslo hortonworks HDP
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 

Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real-Time Processing in Hadoop Phoenix Hadoop User Group Shane Kumpf & Mac Moore Solutions Engineers, Hortonworks July 2015
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Agenda  Introduction & about Hortonworks HDP  Overview of logistics industry scenario  Overview of streaming architecture on HDP  Streaming Demo #1  Integrating Predictive Analytics in streaming scenarios  Streaming Demo with Predictive additions  Q & A Page 2
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Preface: Enabling Technologies Page 3
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Preface: Enabling Technologies Page 4 Enablers: Key technologies from mass consumer-scale deployments.
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Preface: Enabling Technologies Page 5 • Problems solved at scale, via fundamentally new approaches… • Make it possible, even simple, to produce new products/applications that would have been too cost prohibitive – or simply impossible - beforehand. • Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras (from smartphones) have enabled Electric cars, quad-copters, VR displays, & more… • Hadoop has similarly led to breakthroughs in big data scale & capability, and enables new real-time advanced analytic applications.
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Why did Hadoop emerge? April 2015
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Traditional systems under pressure Challenges • Constrains data to app • Can’t manage new data • Costly to Scale Business Value Clickstream Geolocation Web Data Internet of Things Docs, emails Server logs 2012 2.8 Zettabytes 2020 40 Zettabytes LAGGARDS INDUSTRY LEADERS 1 2 New Data ERP CRM SCM New Traditional
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Spring 2015 Hortonworks. We do Hadoop.
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Customer Momentum • 430+ customers (Q1 2015) Hortonworks Data Platform • Completely open multi-tenant platform for any app & any data. • A centralized architecture of consistent enterprise services for resource management, security, operations, and governance. Partner for Customer Success • Open source community leadership focus on enterprise needs • Unrivaled world class support • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • 600+ Employees • 1000+ Ecosystem Partners
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Customer Partnerships matter Driving our innovation through Apache Software Foundation Projects Apache Project Committers PMC Members Hadoop 27 21 Pig 5 5 Hive 18 6 Tez 16 15 HBase 6 4 Phoenix 4 4 Accumulo 2 2 Storm 3 2 Slider 11 11 Falcon 5 3 Flume 1 1 Sqoop 1 1 Ambari 34 27 Oozie 3 2 Zookeeper 2 1 Knox 13 3 Ranger 10 n/a TOTAL 161 108 Source: Apache Software Foundation. As of 11/7/2014. Hortonworkers are the architects and engineers that lead development of open source Apache Hadoop at the ASF • Expertise Uniquely capable to solve the most complex issues & ensure success with latest features • Connection Provide customers & partners direct input into the community roadmap • Partnership We partner with customers with subscription offering. Our success is predicated on yours. 27 Cloudera: 11 Facebook: 5 LinkedIn: 2 IBM: 2 Others: 23 Yahoo 10
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Technology Partnerships matter Apache Project Hortonworks Relationship Named Partner Certified Solution Resells Joint Engr Microsoft     HP     SAS    SAP     IBM    Pivotal    Redhat    Teradata     Informatica    Oracle   It is not just about packaging and certifying software… Our joint engineering with our partners drives open source standards for Apache Hadoop HDP is Apache Hadoop
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a Centralized Architecture Modern Data Architecture • Unifies data and processing. • Enables applications to have access to all your enterprise data through an efficient centralized platform • Supported with a centralized approach governance, security and operations • Versatile to handle any applications and datasets no matter the size or type Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured SOURCES Existing Systems ERP CRM SCM ANALYTICS Data Marts Business Analytics Visualization & Dashboards ANALYTICS Applications Business Analytics Visualization & Dashboards ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) YARN: Data Operating System Interactive Real-TimeBatch Partner ISVBatch BatchMP P EDW
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a completely open data platform Hortonworks Data Platform 2.2 Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core enterprise services, for any application and any data. Completely Open • HDP incorporates every element required of an enterprise data platform: data storage, data access, governance, security, operations • All components are developed in open source and then rigorously tested, certified, and delivered as an integrated open source platform that’s easy to consume and use by the enterprise and ecosystem. YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ApachePig ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Apache Falcon ApacheHive Cascading ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm Apache Sqoop Apache Flume Apache Kafka SECURITY Apache Ranger Apache Knox Apache Falcon OPERATIONS Apache Ambari Apache Zookeeper Apache Oozie
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real World Use Case: Trucking Company Spring 2015 Hortonworks. We do Hadoop.
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Scenario Overview .
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Trucking company w/ large fleet of trucks in Midwest A truck generates millions of events for a given route; an event could be:  'Normal' events: starting / stopping of the vehicle  ‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance Company uses an application that monitors truck locations and violations from the truck/driver in real-time Route? Truck? Driver? Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services What is Kafka? APACHE KAFKA  High throughput distributed messaging system  Publish-Subscribe semantics but re- imagined at the implementation level to operate at speed with big data volumes  Kafka @LinkedIn:  800 billion messages per day  175 terabytes of data written per day  650 terabytes of data read per day  Over 13 million messages/2.75GB of data per second Kafka Cluster producer producer producer consumer consumer consumer
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Kafka: Anatomy of a Topic Partition 0 Partition 1 Partition 2 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 11 11 12 Writes Old New APACHE KAFKA  Partitioning allows topics to scale beyond a single machine/node  Topics can also be replicated, for high availability.
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Apache Storm • Distributed, real time, fault tolerant Stream Processing platform. • Provides processing guarantees. • Key concepts include: •Tuples •Streams •Spouts •Bolts •Topology Page 22
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Tuples and Streams • What is a Tuple? –Fundamental data structure in Storm. Is a named list of values that can be of any data type. Page 23 • What is a Stream? –An unbounded sequences of tuples. –Core abstraction in Storm and are what you “process” in Storm
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Spouts • What is a Spout? –Generates or a source of Streams –E.g.: JMS, Twitter, Log, Kafka Spout –Can spin up multiple instances of a Spout and dynamically adjust as needed Page 24
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Bolts • What is a Bolt? –Processes any number of input streams and produces output streams –Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic –Can spin up multiple instances of a Bolt and dynamically adjust as needed • Bolts used in the Use Case: 1. HBaseBolt: persisting and counting in Hbase 2. HDFSBolt: persisting into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the number of illegal driver incidents exceed a given threshhold. Page 25
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Topology • What is a Topology? –A network of spouts and bolts wired together into a workflow Page 26 Truck-Event-Processor Topology Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt WebSocket Bolt Stream Stream Stream Stream
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Key Constructs in Apache HBase • HBase = Key / Value store • Designed for petabyte scale • Supports low latency reads, writes and updates • Key features – Updateable records – Versioned Records – Distributed across a cluster of machines – Low Latency – Caching • Popular use cases: – User profiles and session state – Object store – Sensor apps Page 28
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Data Assignment Page 29 HBase Table Keys within HBase Divided among different RegionServers
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Data Access • Get –Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with a matching rowkey • Put –Inserts a new version of a cell. • Scan –The whole table, row by row, or a section of that table starting at a particular start key and ending at a particular end key • Delete –It is actually a version of put(Add a new version with put with a deletion marker) • SQL via Apache Phoenix –Unique capability in the NoSQL market Page 30
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 20092006 1 ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) MapReduce Largely Batch Processing Hadoop w/ MapReduce YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Hadoop2 & YARN based Architecture Silo’d clusters Largely batch system Difficult to integrate MR-279: YARN Hadoop 2 & YARN Interactive Real-TimeBatch Architected & led development of YARN to enable the Modern Data Architecture October 23, 2013
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Benefits of YARN as the Data Operating System • The container based model allows for running nearly any workload. –Enables the centralized architecture. –No longer is MapReduce the only data processing engine. –Docker containers managed by YARN. Yes Please! • Decouples resource scheduling from application lifecycle. –Improved scalability and fault tolerence • Dynamically allocated resources, resulting in HUGE utilization gains –Versus static allocation of “slots” in Hadoop 1.0 Page 33 Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time. They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.
  • 34. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  • 35. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Apache HDFS – Hadoop Distributed File System • Very large scale distributed file system • 10K nodes, tens of millions files and PBs of data • Supports large files • Designed to run on commodity hardware, assumes hardware failures • Files are replicated to handle hardware failure • Detect failures and recovers from them automatically • Optimized for Large Scale Processing • Data locations are exposed so that the computations can move to where data resides • Data Coherency • Write once and read many times access pattern • Files are broken up in chunks called ‘blocks’ • Blocks are distributed over nodes Page 35
  • 36. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Streaming Demo - High Level Architecture Distributed Storage: HDFS YARN Storm Stream Processing Kakfa Spout HBase Dangerous Events Table Hbase Bolt HDFS Bolt Truck Events Active MQ Monitoring Bolt Web App Truck Streaming Data T(1) T(2) T(N) Inbound Messaging (Kafka) Truck Events Topic
  • 37. Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo – Streaming Dashboard .
  • 38. Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved A New Challenge .
  • 39. Page39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved CDO’s vision: Build a Predictive Business, not a Reactive one CDO’s Requirements  Offline predictions  Identify investments that will increase safety and reduce company’s liabilities  Real-time predictions  Anticipate driver violations before they happen and take precautionary actions Data Scientist’s Response  Need to explore data & form a hypothesis  Verify trends against TBs of events data via machine learning  Generate predictive models with Spark MLlib on HDP  Plug models into the Storm topology to predict driver violations in real-time ♬ I’ve been waiting for this moment all my life ♬
  • 40. Page40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo – Analyzing Events with Tableau .
  • 41. Page41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Analyzing Raw Events – dangerous drivers Page 41
  • 42. Page42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Analyzing Raw Events – dangerous routes Page 42
  • 43. Page43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Analyzing Raw Events – violations by location Page 43
  • 44. Page44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enriching truck events for analysis with Pig HDFS Raw Truck EventsWeather Data Sets Raw Weather Data HCatalog (Metadata) Payroll Data HR & Payroll DBs Load Raw Truck Events Clean & Filter Cleaned Events Transformed Events Transform Join with HR & weather data Enriched Events Enriched Events Store Tableau
  • 45. Page45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Analyzing Enriched Events – noncertified and fatigued drivers more dangerous Page 45
  • 46. Page46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Analyzing Enriched Events – top 3 dangerous routes seem to be driven by fatigued drivers Page 46
  • 47. Page47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Analyzing Enriched Events – foggy weather leads to violations Page 47
  • 48. Page48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Analyzing Enriched Events – but top 3 safest routes are also foggy Page 48
  • 49. Page49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Integrating Predictive Analytics
  • 50. Page50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Building the Predictive Model on HDP Tableau Explore small subset of events to identify predictive features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver violations” 1 Identify suitable ML algorithms to train a model – we will use classification algorithms as we have labeled events data 2 Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format 3 Train a logistic regression model in Spark on YARN, with above events as training input, and iterate to fine tune the generated model 4 Integrate Spark MLlib model in a Storm bolt to predict violations in real time 5
  • 51. Page51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Truck Sensors HDFS YARN Integrate Predictive Analytics in Stream Processing Stream Processing (Storm) Inbound Messaging (Kafka) Interactive Query (Hive on Tez) Real-time Serving (HBase) Millions of Enriched Truck Events Prediction Bolt Plug Spark model into Storm bolt Machine Learning (Spark) Train Spark ML model with millions of truck events
  • 52. Page52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Streaming Demo - Updated Architecture Distributed Storage: HDFS YARN Storm Stream Processing Kakfa Spout HBase PayRoll TableHBase Bolt HDFS Bolt Truck Events Active MQ Monitoring Bolt Web App Truck Streaming Data T(1) T(2) T(N) Inbound Messaging (Kafka) Truck Events Topic Prediction Bolt Enrich Event Predict violation in real time & alert via MQ Render Real time predictions on UI
  • 53. Page53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Transforming training data for Spark MLlib Enriched Events Data Event Type Is Driver Certified? Wage Plan Hours Driven Miles Driven Longitude Latitude Weather Foggy Weather Rainy Weather Windy Normal Yes Hourly 45 2721 -91.3 38.14 No No No Overspeed No Miles 72 4152 -94.23 37.09 Yes Yes No … … … … … … … … … … Spark MLlib Training Data Label Is Driver Certified? Wage Plan Hours Driven Miles Driven Weather Foggy Weather Rainy Weather Windy 0 1 1 0.45 0.2721 0 0 0 1 0 0 0.72 0.4152 1 1 0 … … … … … … … … Normal events labeled as 0 and violation events as 1 Feature scaling applied to hours and miles to improve algorithm performance Features with binary values denoted as 0 and 1
  • 54. Page54 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Running Spark ML on YARN 1 spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master yarn-cluster -- num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 truckml.jar --algorithm LR --regType L2 --regParam 1.0 /user/root/truck_training --numIterations 100 Run spark-submit script to launch a Spark job on YARN. Training data location on HDFS 2 Monitor progress of Spark job in YARN Resource Mgr UI
  • 55. Page55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Interpreting Spark Logistic Regression Results Precision: 87.5% Recall: 88% Top three predictors of violations 1. Foggy Weather 2. Rainy Weather 3. Driver Certification
  • 56. Page56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Integrating Spark model in Storm Kafka Spout Storm Prediction Bolt  Initialize Spark model  Parse truck event  Enrich event with HBase data  Predict violation with model  Send Alert if violation predicted Real-time Serving (HBase) Active MQ Ops Center LOB Dashboards
  • 57. Page57 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary: Solution Value .
  • 58. Page58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Value of large scale ML on HDP  Accelerate time to market/value  Test out multiple ML algorithms against TBs of training data in reasonable time frames  Confirm hypothesis against TBs of training data with confidence  We confirmed that fog does impact safety and wage plans do not, whereas BI tools indicated otherwise  Easily integrate predictive models in data driven apps  Run predictive models in Storm or any other app in your enterprise  Run all of the above in a multi-tenant YARN cluster  Large scale ML on YARN respects other tenants in an HDP cluster
  • 59. Page59 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Recommendations to CDO  Investment recommendations, in order of priority 1. Invest in visibility sensors and auto braking systems to deal with foggy conditions 2. Invest in slip resistant tires to fight rainy conditions 3. Invest in certifying drivers to reduce violation probability  Power of real time predictions  40% reduction in violation rates by predicting high risk situations in real-time and sending immediate alerts to drivers
  • 60. Page60 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Predictive Demo .
  • 61. Page61 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Q & A

Notas do Editor

  1. Why do we now have electric cars at prodcution scale, quad copters and drones cheap enough for the home hobbyist, and VR displays being bought by companies like Facebook?
  2. Because the technology is there now, thanks to advances made in other industries, solving problems at scale in a big marketplace.
  3. At Scale (in this case): $270 bn smartphone mkt in 2014 $120 bn internet advertising (proj 2015)
  4. Before we dive into Hadoop and its role within the modern data architecture, let’s set the context for why Hadoop has become important. Existing approaches for data management have become both technically and commercially impractical. Technically - these systems were never designed to store or process vast quantities of data Commercially – the licensing structures with the traditonal approach are no longer feasible. These two challenges combined with rate at which data is being produce predicated a need for a new approach to data systems. If we fast-forward another 3 to 5 years, more than half of the data under management within the enterprise will be from these new data sources.
  5. Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
  6. Single focus - enabling Apache Hadoop as an enterprise data platform for any app and any data type In the open, partner for success.
  7. Everything in the open.
  8. Joint deep engineering Microsoft (HD Insight), HP, SAP and Teradata
  9. In 2011, Hortonworks was founded with the 24 original Hadoop architects and engineers from Yahoo! This original team had been working on a technology called YARN (Yet Another Resource Negotiator) that enable multiple applications to have access to all your enterprise data through an efficient centralized platform. It is the data operating system for hadoop that provides the versatility to handle any application and dataset no matter the size or type. Moreover, YARN provided the centralized architecture around which the critical enterprise services of Security, Operations, and Governance could be centrally addressed and integrate with existing enterprise policies. This work allowed for a new approach to data to emerge, the modern data architecture. At the heart of this approach is the capability for Hadoop to unify data and processing in an efficient data platform
  10. Our product, the Hortonworks Data Platform (or HDP for short) is a completely open source, enterprise-grade data platform that’s comprised of dozens of Apache open source projects including Apache Hadoop and YARN at its center.   We have a comprehensive engineering, testing, and certification process that integrates and packages all of these components into a cohesive platform that the enterprise can consume and deploy at scale. And our model enables us to proactively manage new innovations and new open source projects into HDP as they emerge.   To ensure the highest quality, we have a test suite, unique to Hortonworks, that is comprised of 10’s of thousands of system and integration tests that we run at scale on a regular basis including on the world’s largest Hadoop clusters at Yahoo! as part of our co-development relationship.   While our pure-play competitors focus on proprietary components for security, operations, and governance, we invest in new open source projects that address these areas.   For example, earlier in 2014, we acquired a small company called XA Secure that provided a comprehensive security and administration product. We flipped the technology in wholesale into open source as Apache Ranger.   Since our security, operations and governance technologies are open source projects, our partners are able to work with us on those projects to ensure deep integration within our joint solution architectures.
  11. Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
  12. Elastic Search Flume Sink does exist
  13. Elastic Search Flume Sink does exist
  14. The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers subscribe to one or more topics A Kafka topic is just a sharded write-ahead log Messages are not deleted when they are read but retained with some configurable SLA (say a few days or a week). This allows usage in situations where the consumer of data may need to reload data. It also makes it possible to support space-efficient publish-subscribe as there is a single shared log no matter how many consumers; in traditional messaging systems there is usually a queue per consumer, so adding a consumer doubles your data size. This makes Kafka a good fit for things outside the bounds of normal messaging systems such as acting as a pipeline for offline data systems such as Hadoop. These offline systems may load only at intervals as part of a periodic ETL cycle, or may go down for several hours for maintenance, during which time Kafka is able to buffer even TBs of unconsumed data if needed Replication for HA/fault tolerance is built in Pull based system for consumers instead of pushed base Crude benchmark: Basically, single threaded synchronous messages are 400k per second when using 6 "datanode-ish" servers. This goes up to 2+ MM when using partitions and asynchronous messages. Server specs in the benchmark: Intel Xeon 2.5 GHz processor with six cores Six 7200 RPM SATA drives 32GB of RAM 1Gb Ethernet
  15. A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then the server hands out messages in the order they are stored. However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing. Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. http://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
  16. Elastic Search Flume Sink does exist
  17. Elastic Search Flume Sink does exist
  18. Elastic Search Flume Sink does exist
  19. This all changed with the introduction of Hadoop 2 and YARN. Introduced in October, 2013 it changed everything.   Introduced in MR-279 by Arun Murthy in 2009, Arun and the team at Hortonworks architected and led it’s development as the core change in Hadoop 2. Our view was that to truly enable Hadoop as a component of a broad data architecture, YARN was the fundamental requirement as it turns Hadoop from a single application data system to a multi application data system. This is foundational to our approach of innovating from the core outwards to build Enterprise Hadoop. With YARN it is now possible to land all data in one cluster and then access it in multiple ways: from batch to interactive to real-time.   Today, YARN, at the core of Hadoop is the center of our focus on innovation in and around Hadoop. It is clearly the enabling technology that has started a transition to a data lake within organizations. Simply stated… Hortonworks Architected & led development of YARN in order to enable the Modern Data Architecture
  20. Elastic Search Flume Sink does exist
  21. Data is ingested, it’s on the dashboard, and it’s in HDFS.
  22. Data is ingested, it’s on the dashboard, and it’s in HDFS.
  23. We’re going to explore a SUBSET of the data. <1m records
  24. BinaryClassification example from Spark LogisticRegression model
  25. Elastic Search Flume Sink does exist