Huhadoop - v1.1

4/18/2014
Prepared for:
Big Data Expedition Roadshow
Presented by:
“Big Data Joe” Rossi
Huhadoop?

Hadoop 1.0 – HDFS + MapReduce
NameNode
DataNode / TaskTracker DataNode / TaskTracker
Secondary
NameNode /
JobTracker
Client
1-1
1-21-3

Hadoop 1.0 – HDFS + MapReduce
NameNode
Secondary
NameNode /
JobTracker
Client
1-1 1-2
1-3
ReduceMap
2-1 3-2 3-3 4-1
2-3 4-2 2-2 3-1 4-3
ReduceMap

MapReduce v1 Limitations
Scalability
Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000
Availability
JobTracker failure kills all queued and running jobs
Resources Partitioned into Map and Reduce
Hard partitioning of Map and Reduce slots led to low resource utilization
No Support for Alternate Paradigms / Services
Only MapReduce batch jobs, nothing else

HADOOP 1.0
Single Use System
Batch Apps
Apache Hadoop 1.0: Single Use System
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management and data
processing)
Pig Hive

YARN Replaces
MapReduce
Yet Another Resource Negotiator
YARN
YARN will be the de-facto distributed
operating system for Big Data

Store DATA in one place
YARN: Taking Hadoop Beyond Batch
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
Applications Run Natively IN Hadoop
HDFS2
YARN
(cluster resource management)
BATCH
(MapReduce)
INTERACTIVE
(Tez, Spark)
ONLINE
(HBase)
STREAMING
(DataTorrent)
GRAPH
(Giraph)

2010
2011
2012
2013
2014
Today
YARN: Moving Quickly
Conceived at Yahoo!
Alpha Releases – 2.0
Beta Releases – 2.1
GA Released – 2.2
100,000+ nodes, 400,000+ jobs daily
10 million+ hours of compute daily
Version 2.3

Graph Processing
Running all on the same Hadoop cluster to give
applications access to all the same source data!
YARN: Applications
MapReduce v2
Real-Time Streaming Analytics
Master-WorkerOnline

YARN: What Has Changed?
YARN MRv1
RMResourceManager
AMApplicationMaster
JT
JobTracker
Scheduler Scheduler
NMNodeManager
TTTaskTracker
Container
Map
Reduce
ResourceManager
Scheduler
JobTracker
Scheduler
NodeManager
ApplicationMaster
TaskTracker
Map Reduce
NodeManager
Container Container
TaskTracker
Map Reduce

Scale
New programming models and
services
Improved cluster utilization
Agility
Backwards compatible with
MapReduce v1
Mixed workloads on the same
source of data
Enables running apps in memory
within the cluster
7 Benefits of YARN

The Future of Hadoop
Projects and Roadmap

Speed
Deliver interactive query through 100x
performance increases as compared to Hive
10.
Stinger: Interactive Query for Hive
SQL
Support the broadest array of SQL semantics
for analytic applications running against
Hadoop.
Scale
The only SQL interface to Hadoop designed for
queries that scale from Terabytes to Petabytes.

Stinger: Speed – Apache Tez
HDFS2
YARN
Tez
(execution layer)
MR Pig Hive

Dynamic Scaling
On-demand cluster size. Increase and decrease
the size with load.
HOYA: HBase on YARN
Easier Deployment
APIs to create, start, stop and delete HBase
clusters.
Availability
Recover from Region Server loss with a new
container.

Machine Learning
Framework well suited for building machine
learning jobs.
Microsoft REEF
Scalable / Fault Tolerant
Makes it easy to implement scalable, fault-
tolerant runtime environments for a range of
computational models.
Maintain State
Users can build jobs that utilize data from
where it’s needed and also maintain state after
jobs are done.
Retainable
Evaluator
Execution
Framework

Heterogeneous Storages in HDFS
NameNode
Storage
NameNode
SATA SSD
Fusion
IO

Apache Hadoop 2.4
ResourceManager HA / Auto Failover
HDFS Rolling Upgrades
Apache Hadoop 2.5
NodeManager Restart w/o disruption
Dynamic Resource Configuration
Hadoop Roadmap
EARLY
Q2 2014
MID
Q2 2014

Questions?
No such thing as a stupid question.
Huhadoop?

Thank You!
Huhadoop?
Big Data Joe Rossi:
http://about.me/bigdatajoe
jrossi@trace3.com
c. 858.761.2918

Supporting Slides
Slides with information that may be asked

YARN: How It Works
ResourceManager
NodeManager
ApplicationMaster
NodeManager
NodeManager NodeManager
Scheduler
Container
Container Container
Client

YARN: Example App Deployment
ResourceManager
NodeManager
HOYA / HBase Master
NodeManager
NodeManager NodeManager
Scheduler
Region Server
Region Server Region Server
HOYA Client

Storm Vs. DataTorrent
Solution Matrix DataTorrent Apache Storm
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 3
Dynamic Runtime Changes 3
Linear Scalability 3
State Checkpointing 3

Apache Spark + Shark
HDFS2
YARN
Apache Spark
Shark
Hive
(sql)

Hadoop 2.x – YARN + HDFS
NameNode
DataNode / NodeManager DataNode / NodeManager
DataNode / NodeManager DataNode / NodeManager
Standby
NameNode /
ResourceManager
ContainerContainer
ContainerContainer
ContainerContainer
ContainerContainer

Backwards Compatible
YARN is Backwards Compatible for your
existing MapReduce applications. You
can get value from it right away.
YARN: Key Take-Aways
Resource Management
YARN enables Fine Grained Resource
Management for better cluster
utilization.
One Source of Data
YARN allows you to interact with One
Source of Data in multiple ways while
maintaining Predictable Performance
and Quality of Service.
Enabling Smart People
YARN is a flexible framework that is
giving smart people and companies to
do amazing things with data.
YARN will be the de-facto distributed operating
system for Big Data

Storm Vs. DataTorrent - Detailed
Solution Matrix DataTorrent Apache Storm
Proprietary / Open Source O O
Support for Hadoop 1.x 1 1
Support for Hadoop 2.x 1 1
Native YARN 1 3
Dashboard 1 3
Extensible via Modules 1 1
Technical Support 1 1
Atomic Micro-batch 1 3
Events per Second Billions Thousands
Automated Parallelism 1 3
Dynamic Runtime Changes 1 3
High Availability 1 2
Prog. Languages Supported Java, Python, etc. Java, Python, etc.
Log Analysis 1 3
Site Operations 1 3
MapReduce Diagnostics 1 3
Open Source Operators Library 1 2
Open Source Application Templates 1 3
Complex Computations (DAG) 1 3
Linear Scalability 1 3
Security 1 3
CLI and Macros 1 3
Configuration Based Specification 1 3
State Checkpointing 1 3

Users forced to
create data system
silos for managing
mixed workloads
Developers forced
to abuse very
specific
MapReduce to fit
their use cases
The 1st Generation Of Hadoop
Hadoop
HBase

Stinger: HiveQL – SQL Support
Hive SQL Datatypes Hive SQL Semantics

Apache Spark
HDFS2
YARN
Apache Spark
Shark
Hive
(sql)
Spark
Streaming
MLib
(machine learning)

Project Mgt Committee Members
0 2 4 6 8 10 12 14 16
Hortonworks
Others
Cloudera
Yahoo!
Facebook
7
6
3
15
11

Project Committers
0 5 10 15 20 25 30
Hortonworks
Others
Cloudera
Yahoo!
Facebook
24
24
11
11
5

YARN: Why The De-Facto Distributed OS
Technology Adoption
100,000 nodes+ - 400,000 jobs - 10m compute hours daily
Enables Innovation
Smart people and companies to do amazing things to data
Financial Backing
568m+ invested in Hadoop contributing companies, nearly 400m in the
2013 alone

Apache Storm Topology
Bolt
(Filter)Spout
Stream
(Data Source)
Spout
Stream
(Data Source)
Bolt
(RDBMS Writes)
Bolt
(Calculation)
Bolt
(HDFS Writes)
RDBMS
HDFS

Hadoop 1.0 – MR + HDFS
NameNode
Secondary
NameNode /
JobTracker
ReduceMap
ReduceMap ReduceMap
ReduceMap

Hadoop 1.0 – MapReduce
JobTracker
TaskTracker
ReduceMap
TaskTracker
ReduceMap
TaskTracker
ReduceMap
TaskTracker
ReduceMap

YARN: Uncharted Territory
You
Are Here
Technology
Value

Huhadoop - v1.1

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Huhadoop - v1.1

Semelhante a Huhadoop - v1.1 (20)

Mais de Big Data Joe™ Rossi

Mais de Big Data Joe™ Rossi (6)

Último

Último (20)

Huhadoop - v1.1

Notas do Editor