SlideShare uma empresa Scribd logo
1 de 79
Baixar para ler offline
Solving Big Data Problems
using Hortonworks
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
ONLY
100open source
Apache Hadoop data platform
% Founded in 2011
HADOOP
1ST
provider to go public
IPO 4Q14 (NASDAQ: HDP)
employees across
800+
countries
technology partners
1,350
17
TM
Hortonworks Company Profile
Fastest company to reach $100 M in revenue
Let’s talk about Big Data
, September 2014 survey of 100 CIOs from the US and Europe
What problems and opportunities does Big Data
create?
Data that
traditional
platforms
cannot handleNEW
TRADITIONAL
The Opportunity
Unlock transformational business value
from a full fidelity of data and analytics
for all data.
Geolocation
Server logs
Files & emails
ERP, CRM, SCM
Traditional Data Sources
New Data Sources
Sensors
and machines
Clickstream
Social media
The Future of Data: Actionable Intelligence
D A T A I N M O T I O N
STORAGE
STORAGE
GR OU P 2GR OU P 1
GR OU P 4GR OU P 3
D A T A
A T R E S T
INTERNET
OF
ANYTHING
Hortonworks Data Platform
H O R T O N W O R K S D ATA P L AT F O R M
Batch Interactive Search Streaming Machine Learning
YARN Resource Management System
CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATIONS SERVER LOG EXISTING
HDP is a collection of Apache Projects
HORTONWORKS DATA PLATFORM
Hadoop&
YARN
Flume
Oozie
Pig
Hive
Tez
Sqoop
Cloudbreak
Ambari
Slider
Kafka
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.01.7.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
1.3.1
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.3
July 2015
4.2.0
Ongoing Innovation in Apache
0.96.1
0.98.0 0.9.1
0.8.1
Hortonworks Data Flow
Visual User Interface
Drag and drop for efficient, agile operations
Immediate Feedback
Start, stop, tune, replay dataflows in real-time
Adaptive to Volume and Bandwidth
Any data, big or small
Event Level Data Provenance
Governance, compliance & data evaluation
Secure Data Acquisition & Transport
Fine grained encryption for controlled data
sharing and selective data democratization
Powered by
Apache NiFi
HDF and HDP Deliver a Complete Big Data Solution
• HDF dynamically connects HDP to
data at the edge
• HDF secures and encrypts the
movement of data into HDP
• HDF includes mature IoAT data
protocols that improve device
extensibility
• HDF supports easily adjustable bi-
direction IoAT dataflows
• HDF offers traceability of IoAT data
with lineage and audit trails
• HDF brings a real-time, visual user
interface to manipulate live dataflows
STORAGE
STORAGE
Hortonworks Revenue Model
HDP and HDF are 100% free and
Open Source – no license.
Our customers subscribe to
support, consulting experts and
training programs
Annual Subscriptions
align your success with ours
Expert Consulting & Training
help your team get to actionable intelligence
as efficiently as possible
ARCHITECT
&
DEVELOP
DEPLOY
OPERATE
Project 1
Project 5
Project 4
Project 3
Project 2
Project 6
EXPAND
Sales Plays
Hadoop Driver: Cost optimization
Archive Data off EDW
Move rarely used data to Hadoop as active
archive, store more data longer
Offload costly ETL process
Free your EDW to perform high-value functions like
analytics & operations, not ETL
Enrich the value of your EDW
Use Hadoop to refine new data sources, such as
web and machine data for new analytical context
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP helps you reduce costs and optimize the value associated with your EDW
ANALYTICSDATASYSTEMS
Data
Marts
Business
Analytics
Visualization
& Dashboards
HDP 2.3
ELT
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data,
Deeper Archive
& New Sources
Enterprise Data
Warehouse
Hot
MPP
In-Memory
Clickstream Web	
&	Social
Geolocation Sensor	
& Machine
Server	
Logs
Unstructured
Existing Systems
ERP CRM SCM
SOURCES
Single View
Improve acquisition and retention
Predictive Analytics
Identify your next best action
Data Discovery
Uncover new findings
Financial Services
New Account Risk Screens Trading Risk Insurance Underwriting
Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service
Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement
Telecom
Unified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse
Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis
Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers
Retail
360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase
Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs
Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior
Manufacturing
Supply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data
Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance
Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields
Healthcare
Electronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials
Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste
Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service
Oil & Gas
Unify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration
DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells
Government
Single View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness
Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting
Hadoop Driver: Advanced analytic applications
NiFi and HDF Drivers
Optimize Splunk:
Reduce costs by pre-filtering data so that
only relevant content is forwarded into Splunk
Ingest Logs for Cyber Security:
Integrated and secure log collection for real-
time data analytics and threat detection
Feed Data to Streaming Analytics:
Accelerate big data ROI by streaming data
into analytics systems such as Apache Storm
or Apache Spark Streaming
Move Data Internally:
Optimize resource utilization by moving
data between data centers or between
on-premises infrastructure and cloud
infrastructure
Capture IoT Data:
Transport disparate and often remote IoT
data in real time, despite any limitations
in device footprint, power or
connectivity—avoiding data loss
Hadoop Driver: Enabling the data lakeSCALE
SCOPE
Data Lake Definition
• Centralized Architecture
Multiple applications on a shared data set
with consistentlevels of service
• Any App, Any Data
Multiple applications accessing all data
affording new insights and opportunities.
• Unlocks ‘Systems of Insight’
Advanced algorithms and applications
used to derive new value and optimize
existing value.
Drivers:
1. Cost Optimization
2. Advanced Analytic Apps
Goal:
• Centralized Architecture
• Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight
Case Study: 12 month Hadoop evolution at TrueCar
DataPlatformCapabilities
12 months execution plan
June 2013
Begin
Hadoop
Execution
July 2013
Hortonworks
Partnership
May ‘14
IPO
Aug 2013
Training
& Dev
Begins
Nov 2013
Production
Cluster
60 Nodes
2 PB
Jan 2014
40% Dev
Staff
Perficient
Dec 2013
Three
Production
Apps
(3 total)
Feb 2014
Three More
Production
Apps
(6 total)
12 Month Results at TRUECar
• Six Production HadoopApplications
• Sixty nodes/2PB data
• Storage Costs/Compute Costs
from $19/GB to $0.12/GB
“We addressed our data platform capabilities
strategically as a pre-cursor to IPO.”
Hortonworks Data Platform
Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for
managing large volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by
large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
ü Manages new data paradigm
ü Handles data at scale
ü Cost effective
ü Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce
20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS	
(Hadoop	Distributed	File	System)
MapReduce
Largely	Batch	Processing
Hadoop w/	
MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279:	YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013
Apache Hadoop – Data Operating System
Shared Compute & Workload Management
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
Common & Shared Scale Out Storage
• Shared data assets
• Flexible schema
• Cross workload access
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Enterprise Hadoop
Core Capabilities of Enterprise Hadoop
Load data and
manage according
to policy
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and Data
Protection
DATA		MANAGEMENT
SECURITYDATA		ACCESS
GOVERNANCE	&	
INTEGRATION
OPERATIONS
Enable both existing and new application to
provide value to the organization
PRESENTATION	&	APPLICATION
Empower existing operations and
security tools to manage Hadoop
ENTERPRISE	MGMT	&	SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT	OPTIONS
Hortonworks Data Platform 2.3
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
TezTez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
Deployment	ChoiceLinux Windows On-Premise Cloud
Data Lifecycle &
Governance
Falcon
Atlas
Architectures
Basic EDW Cost Optimization Architecture
Batch
Sqoop
Transform
Processed
Hive
Raw
HDFS
Interactive
HiveServer
Reporting
BI Tools
Load
EDW
Existing Analytics
Fetch
1
2
3
4
External
Tables
More than save cost, Enrich With New Data
Batch
Sqoop
Transform
Processed
Hive
Raw
HDFS
Interactive
HiveServer
Reporting
BI Tools
Load
EDW
New Sources
Streaming
NiFi
Load
Existing Analytics
Fetch
New Analytics
1
2
3
4
5
6
External
Tables
Streaming Solution Architecture
HDP 2.x Data Lake
YARN
HDFS
APACHE	
KAFKA
Search
Solr
Slider
Online	Data	
Processing
HBase
Accumulo
Real	Time	Stream	
Processing
Storm SQL
HiveStreaming
Ingest
HDFS
HDP 2.x
Real-time
data feeds
Key Tenants of Lambda Architecture
§ Batch Layer
§ Manages master data
§ Immutable, append-only set of raw data
§ Cleanse, Normalize & Pre-Compute
Batch Views
§ Advanced Statistical Calculations
§ Speed layer
§ Real Time Event Stream Processing
§ Computes Real-Time Views
§ Serving Layer
§ Low-latency, ad-hoc query
§ Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
High Level Big Data IoT Architecture
IoT on HDP
Problem Statement
Reference Architecture
& Sizing
Solution Design
& Customer Case Studies
Implementation Plan
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
www.hortonworks.com
Ms. Brady knows to get a
handle on sky-rocketing
premiums, she will need to
better understand what is
causing the incidents and
being able to prevent them.
Ms. Brady sets the goal of
reducing incidents by 5%
within 90 days.
Incidents of maintenance vehicles have continued to increase under COO Brady’s watch
2012
17.5M
2013 2014 2015
Insurance Premiums
Ms. Brady tasks, her
Business Analyst, Tam with
gathering the necessary
data to understand the
cause of and reduce
incidents.
Business Analyst
Tam
Mega Corp has a problem
www.hortonworks.com
Given the current premium cost of $3,500 per
truck on 5,000 trucks, a 10% reduction in
incidents will move the company from the high
risk insurance category they are currently in
and save the company $1000 on their
insurance premium per truck per year or
$5,000,000 annually.
Business Analyst
Tam
www.hortonworks.com
Tam considers four questions she
must answer to better understand
and mitigate incidents. The are:
1) Is there a correlation of driver
training to incidents?
2) Is there a correlation of weather
to incidents?
3) Is there a correlation between
certain driving behavior and
incidents?
4) Is is possible to predict incidents
before they occur?
Business Analyst
Tam
…to Behavioral Insight
From reaction to
human activity
…to Resource Optimization
From static resource
planning
From break then fix
Shift from Reactive……to…... Proactive & Proscriptive
…to Preventative Maintenance
www.hortonworks.com
Initially, Tam’s team is concerned that they may not be
able to capture all the necessary data to answer the
questions Tam has posed and help her mitigate
incidents. They know that the data is not all
structured and some of it is created in real-time and
transmitted over the Internet. In addition, some data
will have to be captured from external sources.
Vehicle Data
Route Data
Weather Data
Structured Driver
Data
Semi-Structured
Maintenance Data
SueVarun Jeff
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DATASYSTEMS
Enterprise Data
Warehouse
Hot
MPP
In-Memory
1
2
Clickstream Web	
&	Social
Geolocation Sensor	
& Machine
Server	
Logs
Unstructured
RDBMS ERPCRM
Systems of Record
The Team Recognizes The Current Data Architectures
Limits Predictive Capabilities
1. Data Silos: difficult to find
predictive correlations
2. Data Volumes: cannot
store enough data to find
patterns
3. New Data Sources:
unable to capture and use
new data for real-time
analysis
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
3
Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DATASYSTEMS
Enterprise Data
Warehouse
Hot
MPP
In-Memory
RDBMS ERPCRM
Systems of Record
The Team Leverages HDF & HDP to Expand The
Capabilities of Their Existing Data Platform
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
www.hortonworks.com
+
HDP Data Analyst
Training
=
HDP Data Analyst
+
Developer Training
=
HDP Developer
+
HDP System Admin
Training
=
HDP Sys Admin
+
Data Science Training
=
HDP Data Scientist
Developer System Admin SME
SueVarun Jeff
Business Analyst
Tam
Then team engages
their favorite SI and
attends Hortonworks
University training to
get the project under
way
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IoT on HDP
Problem Statement
Reference Architecture
& Sizing
Solution Design
& Customer Case Studies
Implementation Plan
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Stream	Processing	&	Modeling
(Kafka,		Storm	&	Spark)
Solution Architecture
Distributed	Storage:	HDFS
Many	Workloads:	YARN
Real-time	Serving	&	
Searching	(Hbase)
Alerts	&	Events
Real-Time	
Web	App
Interactive	Query
(Hive	on	Tez)SQL
Single	cluster	with	
consistent	security,	
governance	&	
operations
Collect,	Conduct	&	Curate	
(HDF	– Bidirectional	Data	Flow)
Truck	Sensors
The chosen solution provides XYZ company with the
foundation to capture all the required data, analyze
correlations, and ultimately create a model that allows them
to predict and mitigate incidents before they happen.
Weather	Data
EDW
Sqoop
www.hortonworks.com
Tam and Varun build the
application
HDP Analyst
Tam Varun
DeveloperAnalyst
www.hortonworks.com
Ms. Brady is happy with the
results. She is able to
determine that a subset of
drivers are responsible for the
increased cost. But like most
managers she is not happy for
long. Now she wants to be able
to predict future incidents.
Data Scientist
Machine Leaning
Jeff points out that HDP has tremendous statistical algorithm library
and he can use these library to predict which drivers are likely to
have an event before the event occurs.
Jeff
www.hortonworks.com
Jeff implements predicted
violations logic using HDP
Machine Learning
and is able to predict events
before they happen
www.hortonworks.com
Ms. Brady is happy now that
she can isolate where problems
exist, identify causal events
and build models that help
predict events before they
occur.
www.hortonworks.com
< TODO: Show St. Louis Case
Study >
http://hortonworks.com/blog/st-louis-buses-run-with-lhp-telematics-and-hortonworks/
Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IoT on HDP
Problem Statement
Reference Architecture
& Sizing
Solution Design
& Customer Case Studies
Implementation Plan
Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Big Data Functional Architecture
Key Tenants of Lambda Architecture
§ Batch Layer
§ Manages master data
§ Immutable, append-only set of raw data
§ Cleanse, Normalize & Pre-Compute
Batch Views
§ Advanced Statistical Calculations
§ Speed layer
§ Real Time Event Stream Processing
§ Computes Real-Time Views
§ Serving Layer
§ Low-latency, ad-hoc query
§ Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
High Level Big Data IoT Architecture
Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storm/Spark Streaming
Storm
Detailed Reference Architecture for IoT Applications
HDF
Flume
Sink to
HDFS
Transform
Interactive
UI Framework
Hive
Hive
HDFS
HDFS
SOURCE DATA
Server logs
Application Logs
Firewall Logs
CRM/ERP
Sensor
Kafka
Kafka
Stream to
HDF
Forward to
Storm
Real Time Storage
Spark-ML
Pig
Alerts
Bolt to
HDFS
Dashboard
Silk
JMS
Alerts
Hive Server
HiveServer
Reporting
BI Tools
High Speed
Ingest
Real-Time
Batch Interactive
Machine Learning
Models
Spark
Pig
Alerts SQOOP
Flume
Iterative ML
Hbase/Pheonix
HBaseEvent Enrichment
Spark-Thrift
Pig
Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sample Ingest: NiFi
Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Storm – Key Attributes
Open source, real-time event stream processing platform that provides fixed,
continuous, & low latency processing for very high frequency streaming data
• Horizontally scalable like Hadoop
• Eg: 10 node cluster can process 1M tuples per secondHighly scalable
• Automatically reassigns tasks on failed nodesFault-tolerant
• Supports at least once & exactly once processing semantics
Guarantees
processing
• Processing logic can be defined in any language
Language
agnostic
• Brand, governance & a large active communityApache project
Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storm - Basic Concepts
Spouts: Generate streams.
Tuple: Most fundamental data structure and is a named
list of values that can be of any datatype
Streams: Groups of tuples
Bolts: Contain data processing, persistence and alerting
logic. Can also emit tuples for downstream bolts
Tuple Tree: First spout tuple and all the tuples that were
emitted by the bolts that processed it
Topology: Group of spouts and bolts wired together into a
workflow
Topology
Page 49 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Distributed Database With Apache HBase
100%	Open	Source
Store	and	Process	Petabytes	of	Data
Flexible	Schema
Scale	out	on	Commodity	Servers
High	Performance,	High	Availability
Integrated	with	YARN
SQL	and	NoSQL Interfaces
YARN	:	Data	Operating	System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent	Data	Storage)
HBase
RegionServer
HBase
RegionServer
Dynamic Schema
Scales Horizontally to PB of Data
Directly Integrated with Hadoop
HDP
Page 50 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Apache Phoenix – Relational Database Layer Over HBase
A SQL Skin for HBase
• Provides a SQL interface for managing data in HBase.
• Large subset of SQL:1999 mandatory featureset.
• Create tables, insert and update data and perform low-latency point lookups through JDBC.
• Phoenix JDBC driver easily embeddable in any app that supports JDBC.
Phoenix Makes HBase Better
• Oriented toward online / transactional apps.
• If HBase is a good fit for your app, Phoenix makes it even better.
• Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.
Page 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
In-Memory With Spark
Spark
SQL
Spark
Streaming
MLlib GraphX
§ A data access engine for fast,
large-scale
data processing
§ Designed for iterative in-
memory computations and
interactive data mining
§ Provides expressive multi-
language APIs for Scala,
Java and Python
Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark ML for machine learning
Democratizes Machine Learning
Unsupervised tasks
• Clustering (K-means)
• Recommendation
• Collaborative Filtering: alternating least squares
• Dimensionality reduction: PCA, SVD
Supervised tasks
• Classification
• Naïve Bayes, Decision Tree, Random Forest, Gradient boosted trees
• Regression
• Linear models (SVM, linear regression, logistic regression)
Page 53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: SQL in Hadoop
• Created by a team at Facebook
• Provides a standard SQL interface to data stored in Hadoop
• Quickly analyze data in raw data files
• Proven at petabyte scale
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business
Objects, etc…
SensorMobile
Weblog
Operational
/	MPP
SQL	Queries
Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Comparing SQL Options In HDP
Project Strengths Use	Cases Unique	Capabilities
Apache	Hive Most	comprehensive	SQL
Scale
Maturity
ETL	Offload
Reporting
Large-scale	aggregations
Robust	cost-based	optimizer
Mature	ecosystem	(BI,	
backup,	security	and	
replication)
SparkSQL In-memory
Low	latency
Exploratory	analytics
Dashboards
Language-integrated	Query
Apache	Phoenix Real-time	read	/	write
Transactions
High	concurrency
Dashboards
System-of-engagement
Drill-down	/	Drill-up
Real-time	read	/	write
Page 55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Comparing Streaming Options In HDP
Apache Storm Spark	Streaming
One	At	A Time Micro	Batch	(minimum	 batch latency	=	500	ms)
Low	Latency Higher	Throughput
Operates	on	Tuple	Stream Operates	on	Streams	of	Tuple Batches
At	Least	Once	
(Trident	For	Exactly	Once)
Exactly	Once
Multiple	Language	Support Multiple	Language Support
Page 56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing
Page 57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDF Sizing & Best Practices Sustained Throughput
For Sustained
Throughput of 50MB/sec
and thousands of events
per second
• 1-2 nodes
• 8+ cores per node
(more is better)
• 6+ disks per node
(SSD or Spinning)
• 2 GB of mem per node
• 1GB bonded NICs
ideally
For Sustained
Throughput of
100MB/sec and tens of
thousands of events per
second
• 3-4 nodes
• 8+ cores per node
(more is better)
• 6+ disks per node
(SSD or Spinning)
• 2 GB of mem per node
• 1GB bonded NICs
ideally
For Sustained
Throughput of
200MB/sec and
hundreds of thousands
of events per second
• 5-7 nodes
• 24+ cores per node
(effective cpus)
• 12+ disks per node
(SSD or spinning)
• 4GB of mem per node
• 10GB bonded NICs
For Sustained
Throughput of 400-
500MB/sec and
hundreds of thousands
of events per second
• 7-10 nodes
• 24+ cores per node
(effective cpus)
• 12+ disks per node
(SSD or spinning)
• 6GB of mem per node
• 10GB bonded NICs
Page 58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Kafka - Sizing & Best Practices
§ Cluster Sizing – Rule of Thumb
– 10 MB/sec/Node or 100,000/sec/Node
• Higher throughput for large batch size
§ Configuration Best Practices
– Num Of Partitions = max (Total Producer Throughput / Throughput per partition, Total Consumer
Throughput / Throughput per partition)
• Over-estimate number of partitions per topic. Cannot increase partition count without breaking
message ordering guarantees
– Collocate Kafka and Storm process
• Storm is CPU bound while Kafka is throughput bound
• In high throughput scenarios, separate Kafka and Storm into independent nodes.
Page 59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storm - Sizing & Best Practices
§ Cluster Sizing – Rule of Thumb
– 100,000 events per second per supervisor node
• Predicated on work being performed by Bolt’s execute method
• Mileage will vary by project
• Testing is critical
§ Configuration Best Practices
– 1 Worker / Machine / Topology
– 1 Executor per CPU Core
– Topology Parallelism = Num of Machines x (Num of Cores Per Machine -1 )
• Distribute total parallelism among spout and bolts to maximize topology throughput
Page 60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hbase - Sizing & Best Practices
§ Cluster Sizing – Rule of Thumb
– 10 MB/sec/node of Write Throughput
– 1-3 TB per node of compressed data (non replicated)
• HDFS volume of 6-12 TB
– Sizing = max(required ingestion rate / Write Throughput per node, Total data size/ Data Per Node)
§ Configuration Best Practices
– Region Server Size ~ 10G
– Number of Regions Per Region Server ~ 100-200
– Cluster/Pre-Split tables
– For IOT scenarios
• Consider using Hive to store raw data while using Phoenix to store aggregates
• Batch insert data to Phoenix using MapReduce
– Tailor Batch interval to application SLAs
www.hortonworks.com
Ms. Brady knows to get a
handle on sky-rocketing
premiums, she will need to
better understand what is
causing the incidents and
being able to prevent them.
Ms. Brady sets the goal of
reducing incidents by 5%
within 90 days.
Incidents of maintenance vehicles have continued to increase under
COO Brady’s watch. The Department of Transportation has contacted
Mega Corporation.
2012
17.5M
2013 2014 2015
Insurance Premiums
Ms. Brady tasks, her
Business Analyst, Tam with
gathering the necessary
data to understand the
cause of and reduce
incidents.
Business Analyst
Tam
Problem statement recap
www.hortonworks.com
Given the current premium cost of $3,500 per
truck on 5,000 trucks, a 10% reduction in
incidents will move the company from the high
risk insurance category they are currently in
and save the company $1000 on their
insurance premium per truck per year or
$5,000,000 annually.
Business Analyst
Tam
Problem statement recap
Page 63 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing - Cluster Storage Requirement
Effective
Capacity
× Intermediate Size
× Replication Count
× Temp Space
Compression Ratio
Rule of thumb
§ Replication Count: 3
§ Temp Space: x1.2
Vary greatly
§ Intermediate/Materialized: 30-50%
§ Compression Ratio: 2-4
Page 64 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Volume for Mega Corp
§ Number of Trucks = 5000
§ Events per second per truck = 10
§ Size of each event = 128 Bytes
§ 1 year raw sensor data storage requirements: 5000 x 10 x 128 x 60 x 60 x 24 x 365 = 200 TB
§ 5 year sensor data storage: 200TB X 5 X 1.5 (processing overhead) = 1.5 PB
§ Q: How many nodes are needed for storing 1.5PB? (answered later)
Page 65 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HBase, Kafka, Storm and NiFi Requirements
Ingest rate = 128 Bytes X 5000 trucks X 10 events/s = 6.4 KB/s
Q: For 6.4 KB/s ingest rate, how many NiFi, Kafka and Storm nodes are needed?
We will store last 15 days of data in Hbase.
Hbase storage needed: 5000 * 10 * 60 * 60 * 24 * 15 * 128 = 8.2 TB
Q: How many Hbase nodes are needed for 8.2TB storage?
Page 66 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing - Number Of Worker Nodes for Sensor Data
§ # of Worker Nodes = = = 32
Storage Per Server
Total Cluster Storage 1.5 PB
48 TB
Page 67 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sizing – NiFi, Kafka, Hbase and Storm Nodes
DataNodes
& Hbase
NiFi Kafka & Storm
Ingest Nodes
Client
Nodes
Master
Nodes
Total
32 2 3 2 5 44
§ Recall that:
§ NiFi can collect @ 50 MB/s/node
§ Kafka can ingest @10MB/s/node or 100,000 events/s/node
§ Storm can process @ 100,000 events/s/node
§ Each HBase Region Server can store 1TB
§ So for 6.4 KB/s ingest rate: 1 NiFi , 1 Kafka, 1 Storm nodes are sufficient.
§ We will use 2 NiFi & 3 Kafka for HA.
§ Hbase nodes needed = 1.5PB/1TB = 8 nodes
§ Co-locate Kafka and Storm.
§ Co-locate DataNode and Hbase.
www.hortonworks.com
NiFi 1
NiFi 2
Storm 1
Kafka 1
Storm 2
Kafka 2
Storm 3
Kafka 3
DataNode 1
HBase 1
Truck 1
Truck 2
Truck 3
Truck
5000
NiFi Nodes
Edge Nodes
Master NodesClients 1
Clients 2
DataNode 2
Hbase 2
DataNode 3
Hbase 3
DataNode 4
Hbase 4
DataNode 5
Hbase 5
DataNode 6
Hbase 6
DataNode 7
Hbase 7
DataNode 8
Hbase 8
DataNode 9 DataNode 10
DataNode 31 DataNode 32
Master 1
Master 2
Master 3
Master 4
Master 5
Worker Nodes
HDF
HDP
World
Megacorp
Datacenter
Page 69 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ingest Node 1Master Node 4
StormHiveserver
WebHCat
Falcon
Worker Node 1
Node
Manager
Datanode
hBase
Region
Worker Node 2
Node
Manager
Datanode
hBase
Region
Worker Node 3
Node
Manager
Datanode
hBase
Region
Worker Node 4
Node
Manager
Datanode
hBase
Region
Worker Node 5
Node
Manager
Datanode
hBase
Region
hBase
Master 1
Master Node 3Master Node 2Master Node 1
Namenode
1
Zookeeper
Oozie
Zookeeper
Namenode
2
Resource
Manager 1
Zookeeper
History
Server
Timeline
Server
Hiveserver
2
Journal
Keeper
Journal
Keeper
Journal
Keeper
Resource
Manager 2
hBase
Master 2
Kafka
Master Node 5
Zookeeper
History
Server
Ambari
Monitoring
& Metrics
Worker Node 32
Node
Manager
Datanode
hBase
Region
Ingest Node 2
Storm
Kafka
Ingest Node 3
Storm
Kafka
Edge Node 1
Clients
Knox
Edge Node 1
Clients
Knox
HDP Service Layout
Page 70 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Master Node Specs
12 + Cores
128 - 256 GB RAM
(1 X 256GB SSD Drive for OS)
(2 X 1TB Drives)
2 X 1 – 10 Gb Switch
Approximate Cost Per Node $8,000 - $18,000
Page 71 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi Nodes Specs
8+ Cores
16 GB RAM
(1 X 256GB SSD Drive for OS)
(2 X 1TB Drives)
2 X 1 – 10 Gb Switch
Approximate Cost Per Node $5,000 - $8,000
Page 72 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Slave Nodes Specs
12 + Cores
32 - 64 GB RAM
12 X 1 TB SATA Drives (Processing/IOPS Optimized)
12 X 2 TB SATA Drives (Balanced)
12 X 4 TB SATA Drives (Storage Optimized)
1 X 1 – 10 Gb Switch
Approximate Cost Per Node $5,000 - $12,000
Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IoT on HDP
Problem Statement
Reference Architecture
& Sizing
Solution Design
& Customer Case Studies
Implementation Plan
Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
Page 74 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Plan
Strategy
10 days
Training
10 days
Design & Build
60 days
Test
30 days
Promote
10 days
Use Case Workshop
Cluster Build-out
Solution Build-out
Prove-out
Promote Solution
Tam puts together a
quick project plan and
estimates it will take 120
days to get Ms. Brady
her solution
www.hortonworks.com
75
Resource Plan
Data Scientist
Consultant
Tam
Data Flow
Consultant
Varun
Architect
Consultant
Jeff
Developer
Consultant
Sue
Project Manager
Jen
Engagement Manager
Consultant
Jim
Enterprise Architect
Frank
Business Analyst
Sue
Developer
Jim
IoT on HDP
Problem Statement
Reference Architecture
& Sizing
Solution Design
& Customer Case Studies
Implementation Plan
Page 76 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project Cost & ROI
Project Cost
Component Quantity Unit Cost Total Cost
Hardware 44 $10,000 $440K
Software – HDP 11 SKUs $18,000/SKU $198K
Software – HDF 2 SKUs $36000/SKU $72K
Dev and Test
Consulting
3040 hrs* $300/hr $912K
Engagement
Consulting
360 hrs* $300/hr $108K
Training 30** $2500 $75K
Travel & Expense $100K
Total $1.885M
* 4 resources x 8 hrs x 95 days, engagementmgr for 45 days
** Admin,Analyst & Data Science Training for 30 associates
Page 78 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Project ROI
§ Insurance Cost Reduction – 5M
§ Project Cost – 1.885M
§ First year savings ~ 3.1M
Page 79 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow
Thank You

Mais conteúdo relacionado

Mais procurados

The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...Hortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsHortonworks
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaDataWorks Summit/Hadoop Summit
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...DataWorks Summit
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success DataWorks Summit/Hadoop Summit
 

Mais procurados (20)

The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
 
Rob Bearden Keynote Hadoop Summit San Jose
Rob Bearden Keynote Hadoop Summit San JoseRob Bearden Keynote Hadoop Summit San Jose
Rob Bearden Keynote Hadoop Summit San Jose
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 

Semelhante a Solving Big Data Problems using Hortonworks

Hortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with HadoopHortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with HadoopMats Johansson
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Hortonworks
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Hortonworks
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataPentaho
 
4. Big data & analytics HP
4. Big data & analytics HP4. Big data & analytics HP
4. Big data & analytics HPMITEF México
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureMats Johansson
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 

Semelhante a Solving Big Data Problems using Hortonworks (20)

Hortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with HadoopHortonworks & Bilot Data Driven Transformations with Hadoop
Hortonworks & Bilot Data Driven Transformations with Hadoop
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
 
4. Big data & analytics HP
4. Big data & analytics HP4. Big data & analytics HP
4. Big data & analytics HP
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
 
Haven 2 0
Haven 2 0 Haven 2 0
Haven 2 0
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 

Mais de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Último (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Solving Big Data Problems using Hortonworks

  • 1. Solving Big Data Problems using Hortonworks © Hortonworks Inc. 2011 – 2015. All Rights Reserved
  • 2. ONLY 100open source Apache Hadoop data platform % Founded in 2011 HADOOP 1ST provider to go public IPO 4Q14 (NASDAQ: HDP) employees across 800+ countries technology partners 1,350 17 TM Hortonworks Company Profile Fastest company to reach $100 M in revenue
  • 3. Let’s talk about Big Data , September 2014 survey of 100 CIOs from the US and Europe
  • 4. What problems and opportunities does Big Data create? Data that traditional platforms cannot handleNEW TRADITIONAL The Opportunity Unlock transformational business value from a full fidelity of data and analytics for all data. Geolocation Server logs Files & emails ERP, CRM, SCM Traditional Data Sources New Data Sources Sensors and machines Clickstream Social media
  • 5. The Future of Data: Actionable Intelligence D A T A I N M O T I O N STORAGE STORAGE GR OU P 2GR OU P 1 GR OU P 4GR OU P 3 D A T A A T R E S T INTERNET OF ANYTHING
  • 6. Hortonworks Data Platform H O R T O N W O R K S D ATA P L AT F O R M Batch Interactive Search Streaming Machine Learning YARN Resource Management System CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATIONS SERVER LOG EXISTING
  • 7. HDP is a collection of Apache Projects HORTONWORKS DATA PLATFORM Hadoop& YARN Flume Oozie Pig Hive Tez Sqoop Cloudbreak Ambari Slider Kafka Knox Solr Zookeeper Spark Falcon Ranger HBase Atlas Accumulo Storm Phoenix 4.10.2 DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 0.12.0 0.12.0 0.12.1 0.13.0 0.4.0 1.4.4 1.4.4 3.3.23.4.5 0.4.00.5.0 0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2 4.0.04.7.2 1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.01.7.0 1.4.0 1.5.1 4.0.0 1.3.1 1.5.1 1.4.4 3.4.5 1.3.1 2.2.0 2.4.0 2.6.0 2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0 HDP 2.3 July 2015 4.2.0 Ongoing Innovation in Apache 0.96.1 0.98.0 0.9.1 0.8.1
  • 8. Hortonworks Data Flow Visual User Interface Drag and drop for efficient, agile operations Immediate Feedback Start, stop, tune, replay dataflows in real-time Adaptive to Volume and Bandwidth Any data, big or small Event Level Data Provenance Governance, compliance & data evaluation Secure Data Acquisition & Transport Fine grained encryption for controlled data sharing and selective data democratization Powered by Apache NiFi
  • 9. HDF and HDP Deliver a Complete Big Data Solution • HDF dynamically connects HDP to data at the edge • HDF secures and encrypts the movement of data into HDP • HDF includes mature IoAT data protocols that improve device extensibility • HDF supports easily adjustable bi- direction IoAT dataflows • HDF offers traceability of IoAT data with lineage and audit trails • HDF brings a real-time, visual user interface to manipulate live dataflows
  • 10. STORAGE STORAGE Hortonworks Revenue Model HDP and HDF are 100% free and Open Source – no license. Our customers subscribe to support, consulting experts and training programs Annual Subscriptions align your success with ours Expert Consulting & Training help your team get to actionable intelligence as efficiently as possible ARCHITECT & DEVELOP DEPLOY OPERATE Project 1 Project 5 Project 4 Project 3 Project 2 Project 6 EXPAND
  • 12. Hadoop Driver: Cost optimization Archive Data off EDW Move rarely used data to Hadoop as active archive, store more data longer Offload costly ETL process Free your EDW to perform high-value functions like analytics & operations, not ETL Enrich the value of your EDW Use Hadoop to refine new data sources, such as web and machine data for new analytical context ANALYTICS Data Marts Business Analytics Visualization & Dashboards HDP helps you reduce costs and optimize the value associated with your EDW ANALYTICSDATASYSTEMS Data Marts Business Analytics Visualization & Dashboards HDP 2.3 ELT ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Cold Data, Deeper Archive & New Sources Enterprise Data Warehouse Hot MPP In-Memory Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured Existing Systems ERP CRM SCM SOURCES
  • 13. Single View Improve acquisition and retention Predictive Analytics Identify your next best action Data Discovery Uncover new findings Financial Services New Account Risk Screens Trading Risk Insurance Underwriting Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement Telecom Unified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers Retail 360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior Manufacturing Supply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields Healthcare Electronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service Oil & Gas Unify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells Government Single View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting Hadoop Driver: Advanced analytic applications
  • 14. NiFi and HDF Drivers Optimize Splunk: Reduce costs by pre-filtering data so that only relevant content is forwarded into Splunk Ingest Logs for Cyber Security: Integrated and secure log collection for real- time data analytics and threat detection Feed Data to Streaming Analytics: Accelerate big data ROI by streaming data into analytics systems such as Apache Storm or Apache Spark Streaming Move Data Internally: Optimize resource utilization by moving data between data centers or between on-premises infrastructure and cloud infrastructure Capture IoT Data: Transport disparate and often remote IoT data in real time, despite any limitations in device footprint, power or connectivity—avoiding data loss
  • 15. Hadoop Driver: Enabling the data lakeSCALE SCOPE Data Lake Definition • Centralized Architecture Multiple applications on a shared data set with consistentlevels of service • Any App, Any Data Multiple applications accessing all data affording new insights and opportunities. • Unlocks ‘Systems of Insight’ Advanced algorithms and applications used to derive new value and optimize existing value. Drivers: 1. Cost Optimization 2. Advanced Analytic Apps Goal: • Centralized Architecture • Data-driven Business DATA LAKE Journey to the Data Lake with Hadoop Systems of Insight
  • 16. Case Study: 12 month Hadoop evolution at TrueCar DataPlatformCapabilities 12 months execution plan June 2013 Begin Hadoop Execution July 2013 Hortonworks Partnership May ‘14 IPO Aug 2013 Training & Dev Begins Nov 2013 Production Cluster 60 Nodes 2 PB Jan 2014 40% Dev Staff Perficient Dec 2013 Three Production Apps (3 total) Feb 2014 Three More Production Apps (6 total) 12 Month Results at TRUECar • Six Production HadoopApplications • Sixty nodes/2PB data • Storage Costs/Compute Costs from $19/GB to $0.12/GB “We addressed our data platform capabilities strategically as a pre-cursor to IPO.”
  • 18. Hadoop emerged as foundation of new data architecture Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data • Built by Yahoo! to be the heartbeat of its ad & search business • Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises • Incredibly disruptive to current platform economics Traditional Hadoop Advantages ü Manages new data paradigm ü Handles data at scale ü Cost effective ü Open source Traditional Hadoop Had Limitations Batch-only architecture Single purpose clusters, specific data sets Difficult to integrate with existing investments Not enterprise-grade Application Storage HDFS Batch Processing MapReduce
  • 19. 20092006 1 ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) MapReduce Largely Batch Processing Hadoop w/ MapReduce YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Hadoop2 & YARN based Architecture Silo’d clusters Largely batch system Difficult to integrate MR-279: YARN Hadoop 2 & YARN Interactive Real-TimeBatch Architected & led development of YARN to enable the Modern Data Architecture October 23, 2013
  • 20. Apache Hadoop – Data Operating System Shared Compute & Workload Management • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases Common & Shared Scale Out Storage • Shared data assets • Flexible schema • Cross workload access YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Enterprise Hadoop
  • 21. Core Capabilities of Enterprise Hadoop Load data and manage according to policy Deploy and effectively manage the platform Store and process all of your Corporate Data Assets Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection DATA MANAGEMENT SECURITYDATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS Enable both existing and new application to provide value to the organization PRESENTATION & APPLICATION Empower existing operations and security tools to manage Hadoop ENTERPRISE MGMT & SECURITY Provide deployment choice across physical, virtual, cloud DEPLOYMENT OPTIONS
  • 22. Hortonworks Data Platform 2.3 YARN : Data Operating System DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS EncryptionData Workflow Sqoop Flume Kafka NFS WebHDFS Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper Scheduling Oozie Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Spark Others ISV Engines TezTez Tez Slider Slider HDFS Hadoop Distributed File System DATA MANAGEMENT Hortonworks Data Platform 2.3 Deployment ChoiceLinux Windows On-Premise Cloud Data Lifecycle & Governance Falcon Atlas
  • 24. Basic EDW Cost Optimization Architecture Batch Sqoop Transform Processed Hive Raw HDFS Interactive HiveServer Reporting BI Tools Load EDW Existing Analytics Fetch 1 2 3 4 External Tables
  • 25. More than save cost, Enrich With New Data Batch Sqoop Transform Processed Hive Raw HDFS Interactive HiveServer Reporting BI Tools Load EDW New Sources Streaming NiFi Load Existing Analytics Fetch New Analytics 1 2 3 4 5 6 External Tables
  • 26. Streaming Solution Architecture HDP 2.x Data Lake YARN HDFS APACHE KAFKA Search Solr Slider Online Data Processing HBase Accumulo Real Time Stream Processing Storm SQL HiveStreaming Ingest HDFS HDP 2.x Real-time data feeds
  • 27. Key Tenants of Lambda Architecture § Batch Layer § Manages master data § Immutable, append-only set of raw data § Cleanse, Normalize & Pre-Compute Batch Views § Advanced Statistical Calculations § Speed layer § Real Time Event Stream Processing § Computes Real-Time Views § Serving Layer § Low-latency, ad-hoc query § Reporting, BI & Dashboard New Data Stream Store Pre-Compute Views Process Streams Incremental Views Business View Business View Query SPEED LAYER BATCH LAYER SERVING LAYER HDP and HDF High Level Big Data IoT Architecture
  • 28. IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  • 29. www.hortonworks.com Ms. Brady knows to get a handle on sky-rocketing premiums, she will need to better understand what is causing the incidents and being able to prevent them. Ms. Brady sets the goal of reducing incidents by 5% within 90 days. Incidents of maintenance vehicles have continued to increase under COO Brady’s watch 2012 17.5M 2013 2014 2015 Insurance Premiums Ms. Brady tasks, her Business Analyst, Tam with gathering the necessary data to understand the cause of and reduce incidents. Business Analyst Tam Mega Corp has a problem
  • 30. www.hortonworks.com Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually. Business Analyst Tam
  • 31. www.hortonworks.com Tam considers four questions she must answer to better understand and mitigate incidents. The are: 1) Is there a correlation of driver training to incidents? 2) Is there a correlation of weather to incidents? 3) Is there a correlation between certain driving behavior and incidents? 4) Is is possible to predict incidents before they occur? Business Analyst Tam …to Behavioral Insight From reaction to human activity …to Resource Optimization From static resource planning From break then fix Shift from Reactive……to…... Proactive & Proscriptive …to Preventative Maintenance
  • 32. www.hortonworks.com Initially, Tam’s team is concerned that they may not be able to capture all the necessary data to answer the questions Tam has posed and help her mitigate incidents. They know that the data is not all structured and some of it is created in real-time and transmitted over the Internet. In addition, some data will have to be captured from external sources. Vehicle Data Route Data Weather Data Structured Driver Data Semi-Structured Maintenance Data SueVarun Jeff
  • 33. Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DATASYSTEMS Enterprise Data Warehouse Hot MPP In-Memory 1 2 Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured RDBMS ERPCRM Systems of Record The Team Recognizes The Current Data Architectures Limits Predictive Capabilities 1. Data Silos: difficult to find predictive correlations 2. Data Volumes: cannot store enough data to find patterns 3. New Data Sources: unable to capture and use new data for real-time analysis ANALYTICS Data Marts Business Analytics Visualization & Dashboards 3
  • 34. Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DATASYSTEMS Enterprise Data Warehouse Hot MPP In-Memory RDBMS ERPCRM Systems of Record The Team Leverages HDF & HDP to Expand The Capabilities of Their Existing Data Platform ANALYTICS Data Marts Business Analytics Visualization & Dashboards
  • 35. www.hortonworks.com + HDP Data Analyst Training = HDP Data Analyst + Developer Training = HDP Developer + HDP System Admin Training = HDP Sys Admin + Data Science Training = HDP Data Scientist Developer System Admin SME SueVarun Jeff Business Analyst Tam Then team engages their favorite SI and attends Hortonworks University training to get the project under way
  • 36. Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  • 37. Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Stream Processing & Modeling (Kafka, Storm & Spark) Solution Architecture Distributed Storage: HDFS Many Workloads: YARN Real-time Serving & Searching (Hbase) Alerts & Events Real-Time Web App Interactive Query (Hive on Tez)SQL Single cluster with consistent security, governance & operations Collect, Conduct & Curate (HDF – Bidirectional Data Flow) Truck Sensors The chosen solution provides XYZ company with the foundation to capture all the required data, analyze correlations, and ultimately create a model that allows them to predict and mitigate incidents before they happen. Weather Data EDW Sqoop
  • 38. www.hortonworks.com Tam and Varun build the application HDP Analyst Tam Varun DeveloperAnalyst
  • 39. www.hortonworks.com Ms. Brady is happy with the results. She is able to determine that a subset of drivers are responsible for the increased cost. But like most managers she is not happy for long. Now she wants to be able to predict future incidents. Data Scientist Machine Leaning Jeff points out that HDP has tremendous statistical algorithm library and he can use these library to predict which drivers are likely to have an event before the event occurs. Jeff
  • 40. www.hortonworks.com Jeff implements predicted violations logic using HDP Machine Learning and is able to predict events before they happen
  • 41. www.hortonworks.com Ms. Brady is happy now that she can isolate where problems exist, identify causal events and build models that help predict events before they occur.
  • 42. www.hortonworks.com < TODO: Show St. Louis Case Study > http://hortonworks.com/blog/st-louis-buses-run-with-lhp-telematics-and-hortonworks/
  • 43. Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  • 44. Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Big Data Functional Architecture Key Tenants of Lambda Architecture § Batch Layer § Manages master data § Immutable, append-only set of raw data § Cleanse, Normalize & Pre-Compute Batch Views § Advanced Statistical Calculations § Speed layer § Real Time Event Stream Processing § Computes Real-Time Views § Serving Layer § Low-latency, ad-hoc query § Reporting, BI & Dashboard New Data Stream Store Pre-Compute Views Process Streams Incremental Views Business View Business View Query SPEED LAYER BATCH LAYER SERVING LAYER HDP and HDF High Level Big Data IoT Architecture
  • 45. Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storm/Spark Streaming Storm Detailed Reference Architecture for IoT Applications HDF Flume Sink to HDFS Transform Interactive UI Framework Hive Hive HDFS HDFS SOURCE DATA Server logs Application Logs Firewall Logs CRM/ERP Sensor Kafka Kafka Stream to HDF Forward to Storm Real Time Storage Spark-ML Pig Alerts Bolt to HDFS Dashboard Silk JMS Alerts Hive Server HiveServer Reporting BI Tools High Speed Ingest Real-Time Batch Interactive Machine Learning Models Spark Pig Alerts SQOOP Flume Iterative ML Hbase/Pheonix HBaseEvent Enrichment Spark-Thrift Pig
  • 46. Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sample Ingest: NiFi
  • 47. Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Storm – Key Attributes Open source, real-time event stream processing platform that provides fixed, continuous, & low latency processing for very high frequency streaming data • Horizontally scalable like Hadoop • Eg: 10 node cluster can process 1M tuples per secondHighly scalable • Automatically reassigns tasks on failed nodesFault-tolerant • Supports at least once & exactly once processing semantics Guarantees processing • Processing logic can be defined in any language Language agnostic • Brand, governance & a large active communityApache project
  • 48. Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storm - Basic Concepts Spouts: Generate streams. Tuple: Most fundamental data structure and is a named list of values that can be of any datatype Streams: Groups of tuples Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts Tuple Tree: First spout tuple and all the tuples that were emitted by the bolts that processed it Topology: Group of spouts and bolts wired together into a workflow Topology
  • 49. Page 49 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Distributed Database With Apache HBase 100% Open Source Store and Process Petabytes of Data Flexible Schema Scale out on Commodity Servers High Performance, High Availability Integrated with YARN SQL and NoSQL Interfaces YARN : Data Operating System HBase RegionServer 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Permanent Data Storage) HBase RegionServer HBase RegionServer Dynamic Schema Scales Horizontally to PB of Data Directly Integrated with Hadoop HDP
  • 50. Page 50 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Apache Phoenix – Relational Database Layer Over HBase A SQL Skin for HBase • Provides a SQL interface for managing data in HBase. • Large subset of SQL:1999 mandatory featureset. • Create tables, insert and update data and perform low-latency point lookups through JDBC. • Phoenix JDBC driver easily embeddable in any app that supports JDBC. Phoenix Makes HBase Better • Oriented toward online / transactional apps. • If HBase is a good fit for your app, Phoenix makes it even better. • Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.
  • 51. Page 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved In-Memory With Spark Spark SQL Spark Streaming MLlib GraphX § A data access engine for fast, large-scale data processing § Designed for iterative in- memory computations and interactive data mining § Provides expressive multi- language APIs for Scala, Java and Python
  • 52. Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark ML for machine learning Democratizes Machine Learning Unsupervised tasks • Clustering (K-means) • Recommendation • Collaborative Filtering: alternating least squares • Dimensionality reduction: PCA, SVD Supervised tasks • Classification • Naïve Bayes, Decision Tree, Random Forest, Gradient boosted trees • Regression • Linear models (SVM, linear regression, logistic regression)
  • 53. Page 53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: SQL in Hadoop • Created by a team at Facebook • Provides a standard SQL interface to data stored in Hadoop • Quickly analyze data in raw data files • Proven at petabyte scale • Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc… SensorMobile Weblog Operational / MPP SQL Queries
  • 54. Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Comparing SQL Options In HDP Project Strengths Use Cases Unique Capabilities Apache Hive Most comprehensive SQL Scale Maturity ETL Offload Reporting Large-scale aggregations Robust cost-based optimizer Mature ecosystem (BI, backup, security and replication) SparkSQL In-memory Low latency Exploratory analytics Dashboards Language-integrated Query Apache Phoenix Real-time read / write Transactions High concurrency Dashboards System-of-engagement Drill-down / Drill-up Real-time read / write
  • 55. Page 55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Comparing Streaming Options In HDP Apache Storm Spark Streaming One At A Time Micro Batch (minimum batch latency = 500 ms) Low Latency Higher Throughput Operates on Tuple Stream Operates on Streams of Tuple Batches At Least Once (Trident For Exactly Once) Exactly Once Multiple Language Support Multiple Language Support
  • 56. Page 56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing
  • 57. Page 57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDF Sizing & Best Practices Sustained Throughput For Sustained Throughput of 50MB/sec and thousands of events per second • 1-2 nodes • 8+ cores per node (more is better) • 6+ disks per node (SSD or Spinning) • 2 GB of mem per node • 1GB bonded NICs ideally For Sustained Throughput of 100MB/sec and tens of thousands of events per second • 3-4 nodes • 8+ cores per node (more is better) • 6+ disks per node (SSD or Spinning) • 2 GB of mem per node • 1GB bonded NICs ideally For Sustained Throughput of 200MB/sec and hundreds of thousands of events per second • 5-7 nodes • 24+ cores per node (effective cpus) • 12+ disks per node (SSD or spinning) • 4GB of mem per node • 10GB bonded NICs For Sustained Throughput of 400- 500MB/sec and hundreds of thousands of events per second • 7-10 nodes • 24+ cores per node (effective cpus) • 12+ disks per node (SSD or spinning) • 6GB of mem per node • 10GB bonded NICs
  • 58. Page 58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Kafka - Sizing & Best Practices § Cluster Sizing – Rule of Thumb – 10 MB/sec/Node or 100,000/sec/Node • Higher throughput for large batch size § Configuration Best Practices – Num Of Partitions = max (Total Producer Throughput / Throughput per partition, Total Consumer Throughput / Throughput per partition) • Over-estimate number of partitions per topic. Cannot increase partition count without breaking message ordering guarantees – Collocate Kafka and Storm process • Storm is CPU bound while Kafka is throughput bound • In high throughput scenarios, separate Kafka and Storm into independent nodes.
  • 59. Page 59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storm - Sizing & Best Practices § Cluster Sizing – Rule of Thumb – 100,000 events per second per supervisor node • Predicated on work being performed by Bolt’s execute method • Mileage will vary by project • Testing is critical § Configuration Best Practices – 1 Worker / Machine / Topology – 1 Executor per CPU Core – Topology Parallelism = Num of Machines x (Num of Cores Per Machine -1 ) • Distribute total parallelism among spout and bolts to maximize topology throughput
  • 60. Page 60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hbase - Sizing & Best Practices § Cluster Sizing – Rule of Thumb – 10 MB/sec/node of Write Throughput – 1-3 TB per node of compressed data (non replicated) • HDFS volume of 6-12 TB – Sizing = max(required ingestion rate / Write Throughput per node, Total data size/ Data Per Node) § Configuration Best Practices – Region Server Size ~ 10G – Number of Regions Per Region Server ~ 100-200 – Cluster/Pre-Split tables – For IOT scenarios • Consider using Hive to store raw data while using Phoenix to store aggregates • Batch insert data to Phoenix using MapReduce – Tailor Batch interval to application SLAs
  • 61. www.hortonworks.com Ms. Brady knows to get a handle on sky-rocketing premiums, she will need to better understand what is causing the incidents and being able to prevent them. Ms. Brady sets the goal of reducing incidents by 5% within 90 days. Incidents of maintenance vehicles have continued to increase under COO Brady’s watch. The Department of Transportation has contacted Mega Corporation. 2012 17.5M 2013 2014 2015 Insurance Premiums Ms. Brady tasks, her Business Analyst, Tam with gathering the necessary data to understand the cause of and reduce incidents. Business Analyst Tam Problem statement recap
  • 62. www.hortonworks.com Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually. Business Analyst Tam Problem statement recap
  • 63. Page 63 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing - Cluster Storage Requirement Effective Capacity × Intermediate Size × Replication Count × Temp Space Compression Ratio Rule of thumb § Replication Count: 3 § Temp Space: x1.2 Vary greatly § Intermediate/Materialized: 30-50% § Compression Ratio: 2-4
  • 64. Page 64 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Volume for Mega Corp § Number of Trucks = 5000 § Events per second per truck = 10 § Size of each event = 128 Bytes § 1 year raw sensor data storage requirements: 5000 x 10 x 128 x 60 x 60 x 24 x 365 = 200 TB § 5 year sensor data storage: 200TB X 5 X 1.5 (processing overhead) = 1.5 PB § Q: How many nodes are needed for storing 1.5PB? (answered later)
  • 65. Page 65 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HBase, Kafka, Storm and NiFi Requirements Ingest rate = 128 Bytes X 5000 trucks X 10 events/s = 6.4 KB/s Q: For 6.4 KB/s ingest rate, how many NiFi, Kafka and Storm nodes are needed? We will store last 15 days of data in Hbase. Hbase storage needed: 5000 * 10 * 60 * 60 * 24 * 15 * 128 = 8.2 TB Q: How many Hbase nodes are needed for 8.2TB storage?
  • 66. Page 66 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing - Number Of Worker Nodes for Sensor Data § # of Worker Nodes = = = 32 Storage Per Server Total Cluster Storage 1.5 PB 48 TB
  • 67. Page 67 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing – NiFi, Kafka, Hbase and Storm Nodes DataNodes & Hbase NiFi Kafka & Storm Ingest Nodes Client Nodes Master Nodes Total 32 2 3 2 5 44 § Recall that: § NiFi can collect @ 50 MB/s/node § Kafka can ingest @10MB/s/node or 100,000 events/s/node § Storm can process @ 100,000 events/s/node § Each HBase Region Server can store 1TB § So for 6.4 KB/s ingest rate: 1 NiFi , 1 Kafka, 1 Storm nodes are sufficient. § We will use 2 NiFi & 3 Kafka for HA. § Hbase nodes needed = 1.5PB/1TB = 8 nodes § Co-locate Kafka and Storm. § Co-locate DataNode and Hbase.
  • 68. www.hortonworks.com NiFi 1 NiFi 2 Storm 1 Kafka 1 Storm 2 Kafka 2 Storm 3 Kafka 3 DataNode 1 HBase 1 Truck 1 Truck 2 Truck 3 Truck 5000 NiFi Nodes Edge Nodes Master NodesClients 1 Clients 2 DataNode 2 Hbase 2 DataNode 3 Hbase 3 DataNode 4 Hbase 4 DataNode 5 Hbase 5 DataNode 6 Hbase 6 DataNode 7 Hbase 7 DataNode 8 Hbase 8 DataNode 9 DataNode 10 DataNode 31 DataNode 32 Master 1 Master 2 Master 3 Master 4 Master 5 Worker Nodes HDF HDP World Megacorp Datacenter
  • 69. Page 69 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Ingest Node 1Master Node 4 StormHiveserver WebHCat Falcon Worker Node 1 Node Manager Datanode hBase Region Worker Node 2 Node Manager Datanode hBase Region Worker Node 3 Node Manager Datanode hBase Region Worker Node 4 Node Manager Datanode hBase Region Worker Node 5 Node Manager Datanode hBase Region hBase Master 1 Master Node 3Master Node 2Master Node 1 Namenode 1 Zookeeper Oozie Zookeeper Namenode 2 Resource Manager 1 Zookeeper History Server Timeline Server Hiveserver 2 Journal Keeper Journal Keeper Journal Keeper Resource Manager 2 hBase Master 2 Kafka Master Node 5 Zookeeper History Server Ambari Monitoring & Metrics Worker Node 32 Node Manager Datanode hBase Region Ingest Node 2 Storm Kafka Ingest Node 3 Storm Kafka Edge Node 1 Clients Knox Edge Node 1 Clients Knox HDP Service Layout
  • 70. Page 70 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Master Node Specs 12 + Cores 128 - 256 GB RAM (1 X 256GB SSD Drive for OS) (2 X 1TB Drives) 2 X 1 – 10 Gb Switch Approximate Cost Per Node $8,000 - $18,000
  • 71. Page 71 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi Nodes Specs 8+ Cores 16 GB RAM (1 X 256GB SSD Drive for OS) (2 X 1TB Drives) 2 X 1 – 10 Gb Switch Approximate Cost Per Node $5,000 - $8,000
  • 72. Page 72 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Slave Nodes Specs 12 + Cores 32 - 64 GB RAM 12 X 1 TB SATA Drives (Processing/IOPS Optimized) 12 X 2 TB SATA Drives (Balanced) 12 X 4 TB SATA Drives (Storage Optimized) 1 X 1 – 10 Gb Switch Approximate Cost Per Node $5,000 - $12,000
  • 73. Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  • 74. Page 74 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Plan Strategy 10 days Training 10 days Design & Build 60 days Test 30 days Promote 10 days Use Case Workshop Cluster Build-out Solution Build-out Prove-out Promote Solution Tam puts together a quick project plan and estimates it will take 120 days to get Ms. Brady her solution
  • 75. www.hortonworks.com 75 Resource Plan Data Scientist Consultant Tam Data Flow Consultant Varun Architect Consultant Jeff Developer Consultant Sue Project Manager Jen Engagement Manager Consultant Jim Enterprise Architect Frank Business Analyst Sue Developer Jim
  • 76. IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 76 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  • 77. Project Cost Component Quantity Unit Cost Total Cost Hardware 44 $10,000 $440K Software – HDP 11 SKUs $18,000/SKU $198K Software – HDF 2 SKUs $36000/SKU $72K Dev and Test Consulting 3040 hrs* $300/hr $912K Engagement Consulting 360 hrs* $300/hr $108K Training 30** $2500 $75K Travel & Expense $100K Total $1.885M * 4 resources x 8 hrs x 95 days, engagementmgr for 45 days ** Admin,Analyst & Data Science Training for 30 associates
  • 78. Page 78 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project ROI § Insurance Cost Reduction – 5M § Project Cost – 1.885M § First year savings ~ 3.1M
  • 79. Page 79 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow Thank You