With increases in compute workloads and a growing number of users with diverse business use cases, each with varying resource availability requirements, cluster admins require an operationally flexible and scalable way to maintain high cluster utilization while ensuring resource allocation fairness across business organizations. To this end, we added new improvements to Hadoop YARN which allow for:
Dynamically configuring cluster and queue configurations via API/CLI,
Finer control over queue capacities, for example specifying absolute resources instead of percentages for queue capacity, and
Better control of queue hierarchy by supporting queue add/remove/rename/move without restarting ResourceManager.
This talk will first go over our motivations for improving queue management. Next, we will go through each enhancement with examples of how to use it. Finally, we will show how LinkedIn uses these enhancements for a multi-thousand node clusters not only to facilitate queue management, but also to build tools which improve compute utilization and resource usage monitoring.
Speakers:
Jonathan Hung (Linkedin), Xuan Gong (Hortonworks)
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Jun 2017 HUG: Flexible and Scalable Compute Resource Management with Apache Hadoop YARN for Large Organizations
1. Scalable YARN Capacity Management for Large
Organizations
Jonathan Hung (LinkedIn), Xuan Gong (Hortonworks)
2. Who we are
● Jonathan
○ Software engineer on Grid development team at LinkedIn
○ Primarily focusing on YARN
● Xuan
○ Software engineer at Hortonworks
○ Hadoop committer and PMC
3. Agenda
● Problem description
○ Issues with current YARN scheduler
● The Feature - dynamic scheduler reconfiguration
○ New API for easily changing scheduler configurations on the fly
○ YARN-5734
● Production use cases
● Demo
4. Scheduler background
● Cluster resources divided into queues with configurable capacity
● Each queue shared by many users
● Single file (e.g. capacity-scheduler.xml) on local ResourceManager containing queue
capacities, + other configurations:
○ user limits
○ maximum number of applications allowed in each queue
○ ...
6. Centralized Queue Management Issues
● All the queue-management is done thru configuration files
○ Large cluster
○ Complex queue hierarchy
● Only the administrator can make changes
○ For every queue management operations
● Invoke the refresh-queue command
● Inefficient management model
○ Only admin user can manage the queue
○ The users can not create/delete/modify their own queues
7. Queue Configuration Synchronization
● In RM HA scenario
○ Multiple RMs
○ Distribute the queue config files to all RMs
capacity-scheduler.xml
Active RM
Standby RM
17. Production Usage
● Automate user to queue routing for > 4000 users
○ yarn.scheduler.capacity.queue-mappings => business_bob:root.business
business_bob
data_dave
● Queue owner self-management of queues
● Tune queue sizes based on demand, to improve resource utilization