Cloud-Friendly Hadoop and Hive - StampedeCon 2013

CLOUD FRIENDLY HADOOP/HIVE
Shrikanth Shankar | Qubole
VP of Engineering
Thursday, July 25, 13

INTRODUCTION
• Hadoop has revolutionized big data processing
• Becoming the de-facto platform for new data projects
• Started as ﬁle system (HDFS) + Programming framework (Map-Reduce).An ecosystem of
projects has sprung up on top of Hadoop
• Hive, Pig, Cascading etc. - Simple ways of processing data
• Sqoop, Flume etc. - Data movement into and out of HDFS
• Oozie,Azkaban etc. - Workﬂow scheduling
• However, these systems were all designed with an on-premise architecture in mind.
• The cloud is different enough - Some things can/should change.

DN/TT DN/TT
ON-PREMISE HADOOP
ARCHITECTURE
Hadoop Cluster
Namenode
JobTracker
DN/TTDN/TTDN/TT ......
IT control
Relational
systems
(Hive metastore etc.)
End User End User ...... End User

HADOOP ON-PREMISE
• Usually deployed on bare-metal nodes*
• HDFS is store of choice (3-way replication for safety). Locality of data
access is a big design point
• Clusters are mostly static - new machines are added on IT schedule*
• Static clusters means users can focus on their tasks (MR jobs, Hive
queries) and not on cluster management
• IT bears the burden of managing clusters

HADOOP ON-PREMISE
• Partitioning of resources
• Static partitioning with different clusters for Batch and
Interactive workloads
• Within a cluster load balancing is done by the JT scheduler
• Capex costs are signiﬁcant
• IT controlled - requires an Ops team (Hadoop ops, Sysadmin
etc.)

CLOUD ARCHITECTURE
HIGHLY AWS CENTRIC - BUT EVERYONE IS
FOLLOWING FAST

CLOUD COMPONENTS
Object Stores
Ephemeral compute
nodes
Block
Stores
PaaS
Offerings
(RDS, etc.)

INFRASTRUCTURE
CHARACTERISTICS
• Running in aVM
• Not that big a deal usually - except plan for performance variability
• No locality information
• Nodes are ephemeral - if you lose a node you will lose data on the node
• AZ-wide correlated failures are to be expected. Region wide are possible (but rare)
• High capacity Object stores with high cross sectional bandwidth
• High latency, Variability in perf, REMOTE*. Not POSIX compliant
• Persistent block stores
• REMOTE,Variable perf,

INFRASTRUCTURE
CHARACTERISTICS
• ELASTIC
• Add a 100 nodes on demand in a few minutes
• Costs are Op-ex (largely).
• Nodes are per hour (CPU + Disk), Storage is per GB
• Cost management is a key challenge
• Some interesting payment choices (On-demand, Spot, Reserved)

LETS PUTTHESE WORLDS
TOGETHER

STORAGE
• From a cost perspective using HDFS for long term storage
means you pay for both CPU and disk.
• Its also more expensive to make HDFS reliable (cross AZ,
maybe even cross Region?)
• Using an object store allows you to pay only for storage
• With object stores you see latency issues since data is remote

STORAGE
• But node storage is still needed when jobs and queries are
active
• For intermediate job results (not all results should go back
to S3 - e.g. stage outputs in Hive)
• For intermediate data (mapper output)
• Makes scaling nodes challenging
• Also since performance is better - may want to move remote
data to HDFS before accessing

COMPUTE AND CLUSTERS
• If you dont need Hadoop for persistent storage - when do
you need a cluster?
• Bring them up on demand - maybe for every job?
• But that can be expensive - no multiplexing
• Ideally you want to share Hadoop clusters as much as
possible. Shut down cluster when not being used

COMPUTE AND CLUSTERS
• If cluster is dynamic and you need sharing - how do you do
‘discover’ it?
• How about cluster sizing?
• Static is a left over from on-premise
• Be dynamic on the cloud. Hard for end users to do manually

COMPUTE AND CLUSTER
• Adding nodes needs to be done based on load
• E.g. Most of the time jobs need < 5 nodes. A batch job
comes in needs 100 nodes. We should expand the cluster
(for as long as needed)
• Removing nodes is trickier
• If we lose intermediate results lots of work will be lost.
• Job1 uses 100 nodes, produces data spread over all of them.
Job 2 consumes results but only needs 10 nodes. How do you
give up 90 nodes?

COMPUTE AND CLUSTER
• Pricing choices are interesting
• For e.g. spot nodes average half the price of an on-demand
node
• But if price spikes you lose all the spot nodes at once
• Hadoop fault tolerance can retry failed jobs (but expensive) -
what about data loss when you lose all the spot nodes?

END USER EXPERIENCE
• The cloud isnt just about cost - its also about agility.To allow
this we need to focus on the end user experience
• End users would prefer to focus on higher level API’s
• e.g. Run a Hadoop job or a Hive query - speciﬁcs of
clusters should be hidden from them
• Some things should be persistent (log ﬁles, results, ...)
• They get this for free on premise

BETTER END STATE
• IT/dev ops/users should set high level controls
• Usage governance (max cluster size, max bill, cpu hours used
per month etc.)
• End users should focus at the level they understand
• Smart software should bridge the gap

QUESTIONS?

Cloud-Friendly Hadoop and Hive - StampedeCon 2013

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (16)

Semelhante a Cloud-Friendly Hadoop and Hive - StampedeCon 2013

Semelhante a Cloud-Friendly Hadoop and Hive - StampedeCon 2013 (20)

Mais de StampedeCon

Mais de StampedeCon (20)

Último

Último (20)

Cloud-Friendly Hadoop and Hive - StampedeCon 2013