Next Generation of Hadoop MapReduce

Next Generation of Apache Hadoop
MapReduce
Owen O’Malley
oom@yahoo-inc.com
@owen_omalley

What is Hadoop?
 A framework for storing and processing big data on
lots of commodity machines.
- Up to 4,000 machines in a cluster
- Up to 20 PB in a cluster
 Open Source Apache project
 High reliability done in software
- Automated failover for data and computation
 Implemented in Java
 Primary data analysis platform at Yahoo!
- 40,000+ machines running Hadoop

What is Hadoop?
 HDFS – Distributed File System
- Combines cluster’s local storage into a single namespace.
- All data is replicated to multiple machines.
- Provides locality information to clients
 MapReduce
- Batch computation framework
- Tasks re-executed on failure
- User code wrapped around a distributed sort
- Optimizes for data locality of input

Case Study: Yahoo Front Page

Personalized
for each visitor

twice the engagement
Result:
twice the engagement

Recommended links News Interests Top Searches

+79% clicks +160% clicks +43% clicks
vs. randomly selected vs. one size fits all vs. editor selected

3

Hadoop MapReduce Today
 JobTracker
- Manages cluster resources
and job scheduling
 TaskTracker
- Per-node agent
- Manage tasks

Current Limitations
 Scalability
- Maximum Cluster size – 4,000 nodes
- Maximum concurrent tasks – 40,000
- Coarse synchronization in JobTracker
 Single point of failure
- Failure kills all queued and running jobs
- Jobs need to be re-submitted by users
 Restart is very tricky due to complex state
 Hard partition of resources into map and reduce
slots

Current Limitations

 Lacks support for alternate paradigms
- Iterative applications implemented using MapReduce
are 10x slower.
- Users use MapReduce to run arbitrary code
- Example: K-Means, PageRank
 Lack of wire-compatible protocols
- Client and cluster must be of same version
- Applications and workflows cannot migrate to
different clusters

MapReduce Requirements for 2011
 Reliability
 Availability
 Scalability - Clusters of 6,000 machines
- Each machine with 16 cores, 48G RAM, 24TB disks
- 100,000 concurrent tasks
- 10,000 concurrent jobs
 Wire Compatibility
 Agility & Evolution – Ability for customers to
control upgrades to the grid software stack.

MapReduce – Design Focus
 Split up the two major functions of JobTracker
- Cluster resource management
- Application life-cycle management
 MapReduce becomes user-land library

Architecture
 Resource Manager
- Global resource scheduler
- Hierarchical queues
 Node Manager
- Per-machine agent
- Manages the life-cycle of container
- Container resource monitoring
 Application Master
- Per-application
- Manages application scheduling and task execution
- E.g. MapReduce Application Master

Improvements vis-à-vis current MapReduce
 Scalability
- Application life-cycle management is very
expensive
- Partition resource management and application
life-cycle management
- Application management is distributed
- Hardware trends
• Machines are getting bigger and faster
• Moving toward 12 2TB disks instead of 4 1TB disks
• Enables more tasks per a machine

 Availability
- Application Master
• Optional failover via application-specific checkpoint
• MapReduce applications pick up where they left off
- Resource Manager
• No single point of failure - failover via ZooKeeper
• Application Masters are restarted automatically

 Wire Compatibility
- Protocols are wire-compatible
- Old clients can talk to new servers
- Evolution toward rolling upgrades

 Innovation and Agility
- MapReduce now becomes a user-land library
- Multiple versions of MapReduce can run in the
same cluster (a la Apache Pig)
• Faster deployment cycles for improvements
- Customers upgrade MapReduce versions on their
schedule
- Users can use customized MapReduce versions
without affecting everyone!

 Utilization
- Generic resource model
• Memory
• CPU
• Disk b/w
• Network b/w
- Remove fixed partition of map and reduce slots

 Support for programming paradigms other
than MapReduce
- MPI
- Master-Worker
- Machine Learning and Iterative processing
- Enabled by paradigm-specific Application Master
- All can run on the same Hadoop cluster

Summary
 Takes Hadoop to the next level
- Scale-out even further
- High availability
- Cluster Utilization
- Support for paradigms other than MapReduce

Questions?
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/

Next Generation of Hadoop MapReduce

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (17)

Semelhante a Next Generation of Hadoop MapReduce

Semelhante a Next Generation of Hadoop MapReduce (20)

Mais de huguk

Mais de huguk (20)

Next Generation of Hadoop MapReduce