Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal

•Transferir como PPTX, PDF•

2 gostaram•547 visualizações

Yahoo Developer Network

The Next Generation of Hadoop Map-Reduce Sharad Agarwal sharadag@yahoo-inc.com sharad@apache.org

About Me Hadoop Committer and PMC member Architect at Yahoo!

Hadoop Map-Reduce Today JobTracker Manages cluster resources and job scheduling TaskTracker Per-node agent Manage tasks

Current Limitations Scalability Maximum Cluster size – 4,000 nodes Maximum concurrent tasks – 40,000 Coarse synchronization in JobTracker Single point of failure Failure kills all queued and running jobs Jobs need to be re-submitted by users Restart is very tricky due to complex state Hard partition of resources into map and reduce slots

Current Limitations Lacks support for alternate paradigms Iterative applications implemented using Map-Reduce are 10x slower. Example: K-Means, PageRank Lack of wire-compatible protocols Client and cluster must be of same version Applications and workflows cannot migrate to different clusters

Next Generation Map-Reduce Requirements Reliability Availability Scalability - Clusters of 6,000 machines Each machine with 16 cores, 48G RAM, 24TB disks 100,000 concurrent tasks 10,000 concurrent jobs Wire Compatibility Agility & Evolution – Ability for customers to control upgrades to the grid software stack.

Next Generation Map-Reduce – Design Centre Split up the two major functions of JobTracker Cluster resource management Application life-cycle management Map-Reduce becomes user-land library

Architecture Resource Manager Global resource scheduler Hierarchical queues Node Manager Per-machine agent Manages the life-cycle of container Container resource monitoring Application Master Per-application Manages application scheduling and task execution E.g. Map-Reduce Application Master

Improvements vis-à-vis current Map-Reduce Scalability Application life-cycle management is very expensive Partition resource management and application life-cycle management Application management is distributed Hardware trends - Currently run clusters of 4,000 machines 6,000 2012 machines > 12,000 2009 machines <8 cores, 16G, 4TB> v/s <16+ cores, 48/96G, 24TB>

Improvements vis-à-vis current Map-Reduce Availability Application Master Optional failover via application-specific checkpoint Map-Reduce applications pick up where they left off Resource Manager No single point of failure - failover via ZooKeeper Application Masters are restarted automatically

Improvements vis-à-vis current Map-Reduce Wire Compatibility Protocols are wire-compatible Old clients can talk to new servers Rolling upgrades

Improvements vis-à-vis current Map-Reduce Agility / Evolution Map-Reduce now becomes a user-land library Multiple versions of Map-Reduce can run in the same cluster (ala Apache Pig) Faster deployment cycles for improvements Customers upgrade Map-Reduce versions on their schedule

Improvements vis-à-vis current Map-Reduce Utilization Generic resource model Memory CPU Disk b/w Network b/w Remove fixed partition of map and reduce slots

Improvements vis-à-vis current Map-Reduce Support for programming paradigms other than Map-Reduce MPI Master-Worker Machine Learning Iterative processing Enabled by allowing use of paradigm-specific Application Master Run all on the same Hadoop cluster

Summary The next generation of Map-Reduce takes Hadoop to the next level Scale-out even further High availability Cluster Utilization Support for paradigms other than Map-Reduce

Mais conteúdo relacionado

Destaque

The 3 to 5 RuleBill Jensen

07 2a_b_g

Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingJongwook Woo

Map ReduceSri Prasanna

Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Jongwook Woo

Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSystem ID Warehouse

Hadoop World VerticaOmer Trajman

DFA Minimization in Map-ReduceIraj Hedayati

Big Data Analysis With RHadoopDavid Chiu

Wireless hacking and securityAdel Zalok

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network

2nd year pre clinical RPD Terminology, Components and Classification of parti...Sajjad Hussain

Presentation on Wireless border security systemStudent

Wireless Security null seminarNilesh Sapariya

types of dental surveyorNimrah Razzaque Memon

Application of MapReduce in Cloud ComputingMohammad Mustaqeem

Dental surveying of Removal partial dentureAli Alarasy

Mouth preparation for removable partial denture/ dental education in indiaIndian dental academy

Diagnosis and treatment planning of Removable Partial Denture dwijk

Mouth preparation for removable partial dentures /certified fixed orthodontic...Indian dental academy

Destaque (20)

The 3 to 5 Rule

07 2

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Map Reduce

Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop

Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation

Hadoop World Vertica

DFA Minimization in Map-Reduce

Big Data Analysis With RHadoop

Wireless hacking and security

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application

2nd year pre clinical RPD Terminology, Components and Classification of parti...

Presentation on Wireless border security system

Wireless Security null seminar

types of dental surveyor

Application of MapReduce in Cloud Computing

Dental surveying of Removal partial denture

Mouth preparation for removable partial denture/ dental education in india

Diagnosis and treatment planning of Removable Partial Denture

Mouth preparation for removable partial dentures /certified fixed orthodontic...

Semelhante a Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal

YARN Hadoop Summit Bangalore 2011Sharad Agarwal

Next Generation of Hadoop MapReducehuguk

Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks

Hadoop bangalore-meetup-dec-2011-hadoop nextgenInMobi

Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.

YARN: Future of Data Processing with Apache HadoopHortonworks

NextGen Apache Hadoop MapReduceHortonworks

A sdn based application aware and network provisioningStanley Wang

Hadoop World 2011, Apache Hadoop MapReduce Next GenHortonworks

February 2014 HUG : Tez Details and InsidesYahoo Developer Network

Bikas saha：the next generation of hadoop– hadoop 2 and yarnhdhappy001

YARN - Hadoop Next Generation Compute PlatformBikas Saha

Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks

Mantle for DevelopersElectronic Arts / DICE

Get Started Building YARN ApplicationsHortonworks

YARN - Next Generation Compute Platform fo HadoopHortonworks

Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkShay Hassidim

Hadoop 2 @ Twitter, Elephant ScaleDataWorks Summit

Hadoop 2 @Twitter, Elephant Scale. Presented at lohitvijayarenu

Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892

Semelhante a Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal (20)

YARN Hadoop Summit Bangalore 2011

Next Generation of Hadoop MapReduce

Apache Hadoop YARN - The Future of Data Processing with Hadoop

Hadoop bangalore-meetup-dec-2011-hadoop nextgen

Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...

YARN: Future of Data Processing with Apache Hadoop

NextGen Apache Hadoop MapReduce

A sdn based application aware and network provisioning

Hadoop World 2011, Apache Hadoop MapReduce Next Gen

February 2014 HUG : Tez Details and Insides

Bikas saha：the next generation of hadoop– hadoop 2 and yarn

YARN - Hadoop Next Generation Compute Platform

Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

Mantle for Developers

Get Started Building YARN Applications

YARN - Next Generation Compute Platform fo Hadoop

Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark

Hadoop 2 @ Twitter, Elephant Scale

Hadoop 2 @Twitter, Elephant Scale. Presented at

Running Non-MapReduce Big Data Applications on Apache Hadoop

Mais de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal

1. The Next Generation of Hadoop Map-Reduce Sharad Agarwal sharadag@yahoo-inc.com sharad@apache.org

2. About Me Hadoop Committer and PMC member Architect at Yahoo!

3. Hadoop Map-Reduce Today JobTracker Manages cluster resources and job scheduling TaskTracker Per-node agent Manage tasks

4. Current Limitations Scalability Maximum Cluster size – 4,000 nodes Maximum concurrent tasks – 40,000 Coarse synchronization in JobTracker Single point of failure Failure kills all queued and running jobs Jobs need to be re-submitted by users Restart is very tricky due to complex state Hard partition of resources into map and reduce slots

5. Current Limitations Lacks support for alternate paradigms Iterative applications implemented using Map-Reduce are 10x slower. Example: K-Means, PageRank Lack of wire-compatible protocols Client and cluster must be of same version Applications and workflows cannot migrate to different clusters

6. Next Generation Map-Reduce Requirements Reliability Availability Scalability - Clusters of 6,000 machines Each machine with 16 cores, 48G RAM, 24TB disks 100,000 concurrent tasks 10,000 concurrent jobs Wire Compatibility Agility & Evolution – Ability for customers to control upgrades to the grid software stack.

7. Next Generation Map-Reduce – Design Centre Split up the two major functions of JobTracker Cluster resource management Application life-cycle management Map-Reduce becomes user-land library

8. Architecture

9. Architecture Resource Manager Global resource scheduler Hierarchical queues Node Manager Per-machine agent Manages the life-cycle of container Container resource monitoring Application Master Per-application Manages application scheduling and task execution E.g. Map-Reduce Application Master

10. Improvements vis-à-vis current Map-Reduce Scalability Application life-cycle management is very expensive Partition resource management and application life-cycle management Application management is distributed Hardware trends - Currently run clusters of 4,000 machines 6,000 2012 machines > 12,000 2009 machines <8 cores, 16G, 4TB> v/s <16+ cores, 48/96G, 24TB>

11. Improvements vis-à-vis current Map-Reduce Availability Application Master Optional failover via application-specific checkpoint Map-Reduce applications pick up where they left off Resource Manager No single point of failure - failover via ZooKeeper Application Masters are restarted automatically

12. Improvements vis-à-vis current Map-Reduce Wire Compatibility Protocols are wire-compatible Old clients can talk to new servers Rolling upgrades

13. Improvements vis-à-vis current Map-Reduce Agility / Evolution Map-Reduce now becomes a user-land library Multiple versions of Map-Reduce can run in the same cluster (ala Apache Pig) Faster deployment cycles for improvements Customers upgrade Map-Reduce versions on their schedule

14. Improvements vis-à-vis current Map-Reduce Utilization Generic resource model Memory CPU Disk b/w Network b/w Remove fixed partition of map and reduce slots

15. Improvements vis-à-vis current Map-Reduce Support for programming paradigms other than Map-Reduce MPI Master-Worker Machine Learning Iterative processing Enabled by allowing use of paradigm-specific Application Master Run all on the same Hadoop cluster

16. Summary The next generation of Map-Reduce takes Hadoop to the next level Scale-out even further High availability Cluster Utilization Support for paradigms other than Map-Reduce

17. Questions?

Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal

Semelhante a Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal (20)

Mais de Yahoo Developer Network

Mais de Yahoo Developer Network (20)

Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce" by Sharad Agrawal