SlideShare uma empresa Scribd logo
1 de 60
Baixar para ler offline
YARN Resource Management
Using Machine Learning

TrendMicro
劉一正 Tony Liu
About Me
•  劉一正 Tony Liu
•  TrendMicro Staff Engineer
•  Big Data platform Administrator
•  TSMC Big Data Consultant Project
•  Keep improving Big Data platform
•  tony_liu@trend.com.tw; ojavajava@gmail.com
Agenda
•  Questions About YARN
•  The ways to find the answers
•  YARN resource consumption prediction
•  Conclusion
Questions about YARN
YARN
Fair
Scheduler
What is the proper setting for
container 
What is the characteristics of jobs
run in the cluster
How to properly allocate resource
to queues 
Why cluster has resources, but still
has pending jobs
The ways to nd the answers
•  Appropriate configurations for
Container
•  CPU bound / IO bound
•  Queue resource consumption in
the cluster
•  Predict and allocate resources 
Container SeAing	
Job Characteristics	
Proper Allocate
Resource to Queue 	
Resource
Prediction
My Thinking
Container SeAing
Job
CPU / IO bound
•  Correct container seAing	
•  What’s the primary constraints
•  Number of containers in the
cluster	
•  Memory calculation	

Queue Status
•  Queue status in the cluster	
•  Allocate resource by Job SLA	
•  Pending Job and Unused 	
resource in queue	
•  BoAleneck resource 	

Prediction
•  Classify Job type:
CPU bound or
IO bound	

•  Predict resource
consumption	
•  Allocate unused
resource to queue
according to job type
Appropriate congurations for
Container
•  Appropriate configurations for
Container
•  CPU bound / IO bound
•  Queue resource consumption in
the cluster
•  Predict and allocate resource 
Container SeAing	
Job Characteristics	
Proper Allocate
Resource to Queue 	
Resource
Prediction
Appropriate congurations for
Container
Container
•  Total available resource	
- Available vmems:
total memory – reserved memory
- Available vcores:	
total cpu – reserved cpu	
	
•  Number of YARN containers	
- concurrent processing	
min(vcores, 2 * Disks)	
	
•  RAM per container	
max(2G,
total available mem / number of containers)	
	
* reserved: 	
for system and HBase	

YARN
Container
Node
Manager
Scheduler
Map
Reduce	
AM
Appropriate congurations for
Container
•  yarn.nodemanager.resource.memory-mb	
= containers * RAM per container	
= total available vmems	
	
	
•  yarn.nodemanager.resource.cpu-vcores	
= total cores – reserved cores	
= total available vcores	
	
YARN NodeManager Resource
YARN
Container
Node
Manager
Scheduler
Map
Reduce	
AM
Appropriate congurations for
Container
•  yarn.scheduler.minimum-allocation-mb	
= RAM per container	
	
•  yarn.scheduler.maximum-allocation-mb	
= containers * RAM per container	
	
•  yarn.scheduler.minimum-allocation-vcores	
= 1	
	
•  yarn.scheduler.maximum-allocation-vcores	
= total available cores	
	
	
YARN Scheduler
YARN
Container
Node
Manager
Scheduler
Map
Reduce	
AM
Appropriate congurations for
Container
•  mapreduce.map.memory.mb	
= RAM per container	
	
•  mapreduce.map.java.opts	
= 0.8 * RAM per container	
	
•  mapreduce.map.cpu.vcores	
= 1	
	
•  mapreduce.map.disk	
= 0.5	
	
Map
YARN
Container
Node
Manager
Scheduler
Map
Reduce	
AM
Appropriate congurations for
Container
•  mapreduce.reduce.memory.mb	
= 2 * RAM per container	
	
•  mapreduce.reduce.java.opts	
= 0.8 * ( 2 * RAM per container)	
	
•  mapreduce.reduce.cpu.vcores	
= 1	
	
•  mapreduce.reduce.disk	
= 1.33	
	
Reduce
YARN
Container
Node
Manager
Scheduler
Map
Reduce	
AM
Appropriate congurations for
Container
•  yarn.app.mapreduce.am.resource. mb	
= 2 * RAM per container	
	
•  yarn.app.mapreduce.am.command-opts	
= 0.8 * ( 2 * RAM per container)	
	
•  yarn.app.mapreduce.am.resource.cpu-vcores	
= 1	
	
	
AM
YARN
Container
Node
Manager
Scheduler
Map
Reduce	
AM
Container Size – Memory
Calculation
r = Requested memory	
	
The logic works like below:	
	
a. Take max of(requested resource and minimum resource) = max(768, 512) 	
= 768	
	
b. roundup(768, StepFactor) = roundUp (768, 512) == 1279 (Approximately)	
	
Roundup does : 	
((768 + (512 -1)) / 512) * 512 	
	
c. min(roundup(512, stepFactor), maximumresource) = min(1279, 1024) 	
= 1024	
	
So nally, the alloAed memory is 1024 MB, which is what you are geAing.
Container Size – Memory
Calculation
Map
Container
Map Task
Map
Container
Map asking 1500 MB memory per map container	
	
mapreduce.map.memory.mb = 1500	
	
	
	
yarn.scheduler.minimum-allocation-mb = 1024	
RM will allocate 2048 MB container	
	
2 * yarn.scheduler.minimum-allocation-mb
How Many Containers Launch
•  Map split (HDFS block size)	
Input le
Map
Container
Map Task
Reducer	
Container
Application
Master
Container
Map Task
 Map Task
 Map Task
Map
Container
Map
Container
Map
Container
•  Data locality	
(data located,
rack located,
any other NM)	
•  Application Master
will re-aAempt tasks	
•  4 times fail task fail	
	
	
	
•  Require resource from Resource Manager	
•  AM stops sending heartbeats, RM will re-aAempt	
•  2 times fail whole application fail 	
	
•  mapred.job.reduces parameter	
Reducer	
Task
•  Reducers can be given resources before all the map tasks complete
mapreduce.job.reduce.slowstart.completedmaps	
•  Wasting resources on process that are waiting for work	
•  Potentially creating a deadlock when resources are constrained in a
shared environment
Observe the conguration
•  Observe which configuration is best for you through TeraGen and TeraSort
•  hadoop jar $HADOOP_PATH/hadoop-examples.jar teragen
-Dmapreduce.job.maps=$i
-Dmapreduce.map.memory.mb=$k
-Dmapreduce.map.java.opts.max.heap=$MAP_MB
•  hadoop jar $HADOOP_PATH/hadoop-examples.jar terasort
-Dmapreduce.job.maps=$i
-Dmapreduce.job.reduces=$j
-Dmapreduce.map.memory.mb=$k
-Dmapreduce.map.java.opts.max.heap=$MAP_MB
-Dmapreduce.reduce.memory.mb=$k
-Dmapreduce.reduce.java.opts.max.heap=$RED_MB
Container Resource
Requirement Testing
•  Appropriate configurations for
Container
•  CPU bound / IO bound
•  Queue resource consumption in
the cluster
•  Predict and allocate resource 
Container SeAing	
Job Characteristics	
Proper Allocate	
Resource to Queue	
Resource
Prediction
Job Characteristics
•  Container is the basic unit of processing capacity in
YARN, and is an encapsulation of resource elements
(memory, cpu etc.).
•  Different jobs make different workloads on the
cluster, including the CPU-bound and I/O-bound
•  So, what is the characteristics of the jobs running in
the cluster ?
Job Characteristics
•  Reference Tian et al., 2009 investigate the
characteristic of MapReduce jobs in a practical
data center
•  Define a classification model to classify MapReduce
jobs is belong to CPU-bound or I/O-bound
Job Characteristics
•  In the Map-Shuffle phase
does five actions:
1) init input data
2) compute map task
3) store output result to local disk
4) shuffle map tasks result data
5) shuffle reduce input data in
Job Characteristics
•  According to the utilization of I/O and CPU, classification
of workloads on the Map-Shuffle phase of MapReduce
•  MID: map input data
•  MOD: map output data
•  SOD: Shuffle out data (=MOD)
•  SID: Shuffle in data
•  MTCT: Map task completed time
•  DIOR: Disk I/O Rate(DFSIO I/O Rate)
•  n: Number of YARN containers(concurrent processing)
Job Characteristics
• 
•  CPU-Bound	
•  I/O-Bound	
•  DIOR: DFSIO
Job Characteristics
Program
 MID
 MOD
 MTCT
myspn_top_cve
 1395184
 620928
 15185
myspn_top_url
 54481169
 52528135
 9867
aggregate_url
 286007534
 1155960828
 420225
USandbox Data
Statistic
37612436
 4921787
 45423
le-solr-daily
 75167686
 4660452644
 224488
aggregate_url_de
dupe
639896245
 561632270
 73926
myspn_top_url_b
y_origin
499348380
 506962079
 53927
•  Data source: Job history log
Job Characteristics
•  Data source: Job history log
•  Test data set: 5,942
•  Test mode: split 66% train, remainder test
•  Classifier model: RandomForest
•  Attributes: MID, MOD, MTCT, n, dior, lable
=== Summary ===
Correlation coefficient 0.9934
Mean absolute error 0.0099
Root mean squared error 0.0513
Relative absolute error 2.4872 %
Root relative squared error 11.4997 %
Total Number of Instances 2020
Job Characteristics
0	
200	
400	
600	
800	
1000	
1200	
IO Bound	
CPU Bound	
Queue Name
Numbers
of jobs
Queue Type
I/O Bound
domain_census 	
myrep	
pathcensus	

CPU Bound
alps 	
census 	
census-oozie 	
data_importer 	
domain_census-
oozie 	
domain_census_
ews 	
hdfs 	
magicQ 	
myspn 	
platinum 	
platinum-oozie 	
retroscan	
retrosplunk	
rnu 	
spnungle
threatconnect
threathub
threathub-oozie
user
Thinking
•  Besides base on the job’s SLA to allocate resource,
what factors should I consider too?
- Job Characteristics?
- Queue type?
Queue Resource Consumption
•  Appropriate configurations for Container
•  CPU bound / IO bound
•  Queue resource consumption in
the cluster
•  Predict and allocate resource 
Container SeAing	
Job Characteristics	
Proper Allocate	
Resource to Queue	
Resource	
Prediction
Cluster Resource Allocation
•  YARN fair scheduler
- yarn.scheduler.fair.allocation.file
fair-scheduler.xml
•  The allocation file is reloaded every 10 seconds,
allowing changes to be made on the fly.
Cluster Resource Allocation
•  Fair Scheduler
- default queue: root
- Hierarchical queues
- placement policy
- preemption
- resource reserved
•  Cluster resource
- FairShare
memory: x, vcores: y
Cluster Resource Allocation
•  Queue Properties
- minResources (soft limit)
- maxResources (hard limit)
- weight
weight1.0/weight
- maxRunningApps
- schedulingPolicy
YARN
Research	
Production
Service	
Marketing
Report	
adhoc
•  fifo	
•  fair	
•  drf	
Queues
Analysis Cluster Status
•  Retrieve YARN metrics from YARN REST APIs
•  FileSystemCounter
•  JobCounters
•  Task Counters
Pending apps and
Available Vcore0320	
0335	
0350	
0405	
0420	
0435	
0450	
0505	
0520	
0535	
0550	
0605	
0620	
0635	
0650	
0705	
0720	
0735	
0750	
0805	
0820	
0835	
0850	
0905	
0920	
0935	
0950	
1005	
1020	
1035	
1050	
appsPending	
availableVCores	
Time
100 %
50 %
0%
Vcore
Vcores Utilization0320	
0335	
0350	
0405	
0420	
0435	
0450	
0505	
0520	
0535	
0550	
0605	
0620	
0635	
0650	
0705	
0720	
0735	
0750	
0805	
0820	
0835	
0850	
0905	
0920	
0935	
0950	
1005	
1020	
1035	
1050	
total_vCores	
used_vCores	
100 %
50 %
0%
Vcore
Time
Vmemory Utilization0320	
0335	
0350	
0405	
0420	
0435	
0450	
0505	
0520	
0535	
0550	
0605	
0620	
0635	
0650	
0705	
0720	
0735	
0750	
0805	
0820	
0835	
0850	
0905	
0920	
0935	
0950	
1005	
1020	
1035	
1050	
used_memory	
total_memory	
100 %
50 %
Vmemory
Time
0%
Cluster Resource Utilization
Queue
Cluster Resource Utilization
Queue
Cluster Resource Utilization
Queue
BoAleneck Resource
•  Vcores becomes bottleneck resource	
Memory Usage: 41.5%
 VCores Usage: 99.5%
Over Fair Share
•  Cluster still has resources
Over Fair Share
Thinking
•  Why cluster’s resource can’t be fully utilized?
•  Is there any resource limitation? (bottleneck)
•  How to reduce pending jobs when cluster still has
resource?
Thinking
•  Is it possible to predict when will has pending job in
the cluster?	
•  Can I predict the resource consumption at specific
time and dynamic allocate to fully utilize cluster
resource?
Predict Resource Consumption
And Allocate Resource 
•  Appropriate configurations for Container
•  CPU bound / IO bound
•  Queue resource consumption in the cluster
•  Predict and allocate resource 
Container Size	
Job Characteristics	
Proper Allocate	
Resource to Queue	
Resource	
Prediction
YARN resource
consumption prediction
Collect	
Metrics
Data
Processing
Training	 Model
Pre-procession
 Training Model
Evaluate
RMSE
Model
 Prediction
Prediction
 Queue
Consumption
Training Data 
Fields
 Description
 Process
date
 date
 Ignore	
time
 hour: 0 ~ 23
 feature
working day
 0: working day	
1: non-working day
feature
weekday
 week day
 feature
cluster_appsPending
 Pending apps in the cluster
 feature
cluster_appsRunning
 Running apps in the cluster
 feature
cluster_availableMB
 Available vmem in the cluster
 feature
cluster_allocatedMB
 Allocated vmem in the cluster
 feature
cluster_availableVcore
 Available vcore in the cluster
 feature
cluster_allocatedVcore
 Allocated vcore in the cluster
 feature
•  Data source: Job history log
Training Data 
Fields
 Description
 Process
queue_name
 Queue name
 feature	
minResources_memory
 Min vmem for queue
 feature
minResources_vcores
 Min vcore for queue
 feature
maxResources_memory
 Max vmem for queue
 feature
maxResources_vcores
 Max vcore for queue
 feature
numPendingApps
 Pending apps in queue
 feature
numActiveApps
 Running apps in queue
 feature	
usedResources.memory
 Used vmem in queue
 feature
usedResources.vcore
 Used vcore in queue
 feature
label
 label (predict target) 
 label
Training Model
•  Training Model: Linear Regression
•  Predict: vcore
Training Model
•  Training model: RandomForest
•  Predict: vcore
•  Data source: Job history log
•  Test data set: 109,736
•  Test mode: split 66% train, remainder test
•  Attributes: 19
=== Summary ===
Correlation coefficient 0.999
Mean absolute error 0.1262
Root mean squared error 0.8494
Relative absolute error 1.5905 %
Root relative squared error 4.5017 %
Total Number of Instances 37,310
Training Model
•  Training Model: Linear Regression
•  Predict: vmemory
Training Model
•  Training model: RandomForest
•  Predict: vmemory
•  Data source: Job history log
•  Test data set: 109,736
•  Test mode: split 66% train, remainder test
•  Attributes: 19
=== Summary ===
Correlation coefficient 0.9995
Mean absolute error 0.0003
Root mean squared error 0.0019
Relative absolute error 1.4174 %
Root relative squared error 3.2014 %
Total Number of Instances 37,310
Training Model
•  Training model: RandomForest
•  Predict: Pending job
•  Data source: Job history log
•  Test data set: 122,120
•  Test mode: split 66% train, remainder test
•  Attributes: 19
=== Summary ===
Correlation coefficient 0.9917
Mean absolute error 0.0002
Root mean squared error 0.0054
Relative absolute error 7.9308 %
Root relative squared error 14.4934 %
Total Number of Instances 41,521
AAribute Evaluation
•  Predict: Pending jobs
•  Attribute Evaluator: Information Gain
•  Ranked attributes :

ABribute
 Score
maxResource_memory
 1.14465
maxResource_vcore
 1.04186
usedResource_memory
 0.53004
usedResource_vcore
 0.51167
minResource_memory
 0.47563
numActiveApps
 0.34418
minResource_vcore
 0.3179
Experiment Result
•  According to the prediction result, we reallocate
the resource of the queues which may has pending
jobs on specific weekday.
•  Experiment result:
Pending jobs reduce 82%
Pending jobs ratio
Before
 0.005
After
 0.0009
Experiment Result
•  Something you should know:
- The total of queues’ minResources should less than
the cluster fair share
- Queue may not gets its minResources immediately
- Preemption kills resources from other Queues to
satisfy minResources, but also means waste
resources
Experiment Result
•  Something you should know:
- Modify fair-scheduler.xml too frequently may
cause ResourceManager weird
- Failover ResourceManager will cause the jobs
submit by oozie retry again
- Does tight resource cluster need resource
prediction?
Conclusion
•  Deep understand the architecture is the key of
tuning and management.
•  Think about are there any other tools good for my
daily job? Even from different domain.
•  Machine Learning has been used on many domains
for prediction, it definitely can provide you different
perspective.
Q  A
Thank You

Mais conteĂşdo relacionado

Mais procurados

Mais procurados (20)

Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 

Destaque

Destaque (13)

Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environment
 
2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
 
2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈
 
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
BI in Xuenn
BI in XuennBI in Xuenn
BI in Xuenn
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic StackHadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
 

Semelhante a Yarn Resource Management Using Machine Learning

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014
Tsuyoshi OZAWA
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 

Semelhante a Yarn Resource Management Using Machine Learning (20)

Yarn
YarnYarn
Yarn
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
 
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb dataUsing cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
Geek Sync | Performance Tune Like an MVP
Geek Sync | Performance Tune Like an MVPGeek Sync | Performance Tune Like an MVP
Geek Sync | Performance Tune Like an MVP
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Yarn Resource Management Using Machine Learning

  • 1. YARN Resource Management Using Machine Learning TrendMicro 劉一正 Tony Liu
  • 2. About Me •  劉一正 Tony Liu •  TrendMicro Staff Engineer •  Big Data platform Administrator •  TSMC Big Data Consultant Project •  Keep improving Big Data platform •  tony_liu@trend.com.tw; ojavajava@gmail.com
  • 3. Agenda •  Questions About YARN •  The ways to find the answers •  YARN resource consumption prediction •  Conclusion
  • 4. Questions about YARN YARN Fair Scheduler What is the proper setting for container What is the characteristics of jobs run in the cluster How to properly allocate resource to queues Why cluster has resources, but still has pending jobs
  • 5. The ways to nd the answers •  Appropriate configurations for Container •  CPU bound / IO bound •  Queue resource consumption in the cluster •  Predict and allocate resources Container SeAing Job Characteristics Proper Allocate Resource to Queue Resource Prediction
  • 6. My Thinking Container SeAing Job CPU / IO bound •  Correct container seAing •  What’s the primary constraints •  Number of containers in the cluster •  Memory calculation Queue Status •  Queue status in the cluster •  Allocate resource by Job SLA •  Pending Job and Unused resource in queue •  BoAleneck resource Prediction •  Classify Job type: CPU bound or IO bound •  Predict resource consumption •  Allocate unused resource to queue according to job type
  • 7. Appropriate congurations for Container •  Appropriate configurations for Container •  CPU bound / IO bound •  Queue resource consumption in the cluster •  Predict and allocate resource Container SeAing Job Characteristics Proper Allocate Resource to Queue Resource Prediction
  • 8. Appropriate congurations for Container Container •  Total available resource - Available vmems: total memory – reserved memory - Available vcores: total cpu – reserved cpu •  Number of YARN containers - concurrent processing min(vcores, 2 * Disks) •  RAM per container max(2G, total available mem / number of containers) * reserved: for system and HBase YARN Container Node Manager Scheduler Map Reduce AM
  • 9. Appropriate congurations for Container •  yarn.nodemanager.resource.memory-mb = containers * RAM per container = total available vmems •  yarn.nodemanager.resource.cpu-vcores = total cores – reserved cores = total available vcores YARN NodeManager Resource YARN Container Node Manager Scheduler Map Reduce AM
  • 10. Appropriate congurations for Container •  yarn.scheduler.minimum-allocation-mb = RAM per container •  yarn.scheduler.maximum-allocation-mb = containers * RAM per container •  yarn.scheduler.minimum-allocation-vcores = 1 •  yarn.scheduler.maximum-allocation-vcores = total available cores YARN Scheduler YARN Container Node Manager Scheduler Map Reduce AM
  • 11. Appropriate congurations for Container •  mapreduce.map.memory.mb = RAM per container •  mapreduce.map.java.opts = 0.8 * RAM per container •  mapreduce.map.cpu.vcores = 1 •  mapreduce.map.disk = 0.5 Map YARN Container Node Manager Scheduler Map Reduce AM
  • 12. Appropriate congurations for Container •  mapreduce.reduce.memory.mb = 2 * RAM per container •  mapreduce.reduce.java.opts = 0.8 * ( 2 * RAM per container) •  mapreduce.reduce.cpu.vcores = 1 •  mapreduce.reduce.disk = 1.33 Reduce YARN Container Node Manager Scheduler Map Reduce AM
  • 13. Appropriate congurations for Container •  yarn.app.mapreduce.am.resource. mb = 2 * RAM per container •  yarn.app.mapreduce.am.command-opts = 0.8 * ( 2 * RAM per container) •  yarn.app.mapreduce.am.resource.cpu-vcores = 1 AM YARN Container Node Manager Scheduler Map Reduce AM
  • 14. Container Size – Memory Calculation r = Requested memory The logic works like below: a. Take max of(requested resource and minimum resource) = max(768, 512) = 768 b. roundup(768, StepFactor) = roundUp (768, 512) == 1279 (Approximately) Roundup does : ((768 + (512 -1)) / 512) * 512 c. min(roundup(512, stepFactor), maximumresource) = min(1279, 1024) = 1024 So nally, the alloAed memory is 1024 MB, which is what you are geAing.
  • 15. Container Size – Memory Calculation Map Container Map Task Map Container Map asking 1500 MB memory per map container mapreduce.map.memory.mb = 1500 yarn.scheduler.minimum-allocation-mb = 1024 RM will allocate 2048 MB container 2 * yarn.scheduler.minimum-allocation-mb
  • 16. How Many Containers Launch •  Map split (HDFS block size) Input le Map Container Map Task Reducer Container Application Master Container Map Task Map Task Map Task Map Container Map Container Map Container •  Data locality (data located, rack located, any other NM) •  Application Master will re-aAempt tasks •  4 times fail task fail •  Require resource from Resource Manager •  AM stops sending heartbeats, RM will re-aAempt •  2 times fail whole application fail •  mapred.job.reduces parameter Reducer Task •  Reducers can be given resources before all the map tasks complete mapreduce.job.reduce.slowstart.completedmaps •  Wasting resources on process that are waiting for work •  Potentially creating a deadlock when resources are constrained in a shared environment
  • 17. Observe the conguration •  Observe which configuration is best for you through TeraGen and TeraSort •  hadoop jar $HADOOP_PATH/hadoop-examples.jar teragen -Dmapreduce.job.maps=$i -Dmapreduce.map.memory.mb=$k -Dmapreduce.map.java.opts.max.heap=$MAP_MB •  hadoop jar $HADOOP_PATH/hadoop-examples.jar terasort -Dmapreduce.job.maps=$i -Dmapreduce.job.reduces=$j -Dmapreduce.map.memory.mb=$k -Dmapreduce.map.java.opts.max.heap=$MAP_MB -Dmapreduce.reduce.memory.mb=$k -Dmapreduce.reduce.java.opts.max.heap=$RED_MB
  • 18. Container Resource Requirement Testing •  Appropriate configurations for Container •  CPU bound / IO bound •  Queue resource consumption in the cluster •  Predict and allocate resource Container SeAing Job Characteristics Proper Allocate Resource to Queue Resource Prediction
  • 19. Job Characteristics •  Container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements (memory, cpu etc.). •  Different jobs make different workloads on the cluster, including the CPU-bound and I/O-bound •  So, what is the characteristics of the jobs running in the cluster ?
  • 20. Job Characteristics •  Reference Tian et al., 2009 investigate the characteristic of MapReduce jobs in a practical data center •  Define a classification model to classify MapReduce jobs is belong to CPU-bound or I/O-bound
  • 21. Job Characteristics •  In the Map-Shuffle phase does five actions: 1) init input data 2) compute map task 3) store output result to local disk 4) shuffle map tasks result data 5) shuffle reduce input data in
  • 22. Job Characteristics •  According to the utilization of I/O and CPU, classification of workloads on the Map-Shuffle phase of MapReduce •  MID: map input data •  MOD: map output data •  SOD: Shuffle out data (=MOD) •  SID: Shuffle in data •  MTCT: Map task completed time •  DIOR: Disk I/O Rate(DFSIO I/O Rate) •  n: Number of YARN containers(concurrent processing)
  • 24. Job Characteristics Program MID MOD MTCT myspn_top_cve 1395184 620928 15185 myspn_top_url 54481169 52528135 9867 aggregate_url 286007534 1155960828 420225 USandbox Data Statistic 37612436 4921787 45423 le-solr-daily 75167686 4660452644 224488 aggregate_url_de dupe 639896245 561632270 73926 myspn_top_url_b y_origin 499348380 506962079 53927 •  Data source: Job history log
  • 25. Job Characteristics •  Data source: Job history log •  Test data set: 5,942 •  Test mode: split 66% train, remainder test •  Classifier model: RandomForest •  Attributes: MID, MOD, MTCT, n, dior, lable === Summary === Correlation coefficient 0.9934 Mean absolute error 0.0099 Root mean squared error 0.0513 Relative absolute error 2.4872 % Root relative squared error 11.4997 % Total Number of Instances 2020
  • 27. Queue Type I/O Bound domain_census myrep pathcensus CPU Bound alps census census-oozie data_importer domain_census- oozie domain_census_ ews hdfs magicQ myspn platinum platinum-oozie retroscan retrosplunk rnu spnungle threatconnect threathub threathub-oozie user
  • 28. Thinking •  Besides base on the job’s SLA to allocate resource, what factors should I consider too? - Job Characteristics? - Queue type?
  • 29. Queue Resource Consumption •  Appropriate congurations for Container •  CPU bound / IO bound •  Queue resource consumption in the cluster •  Predict and allocate resource Container SeAing Job Characteristics Proper Allocate Resource to Queue Resource Prediction
  • 30. Cluster Resource Allocation •  YARN fair scheduler - yarn.scheduler.fair.allocation.file fair-scheduler.xml •  The allocation file is reloaded every 10 seconds, allowing changes to be made on the fly.
  • 31. Cluster Resource Allocation •  Fair Scheduler - default queue: root - Hierarchical queues - placement policy - preemption - resource reserved •  Cluster resource - FairShare memory: x, vcores: y
  • 32. Cluster Resource Allocation •  Queue Properties - minResources (soft limit) - maxResources (hard limit) - weight weight1.0/weight - maxRunningApps - schedulingPolicy YARN Research Production Service Marketing Report adhoc •  fo •  fair •  drf Queues
  • 33. Analysis Cluster Status •  Retrieve YARN metrics from YARN REST APIs •  FileSystemCounter •  JobCounters •  Task Counters
  • 34. Pending apps and Available Vcore0320 0335 0350 0405 0420 0435 0450 0505 0520 0535 0550 0605 0620 0635 0650 0705 0720 0735 0750 0805 0820 0835 0850 0905 0920 0935 0950 1005 1020 1035 1050 appsPending availableVCores Time 100 % 50 % 0% Vcore
  • 40. BoAleneck Resource •  Vcores becomes bottleneck resource Memory Usage: 41.5% VCores Usage: 99.5%
  • 41. Over Fair Share •  Cluster still has resources
  • 43. Thinking •  Why cluster’s resource can’t be fully utilized? •  Is there any resource limitation? (bottleneck) •  How to reduce pending jobs when cluster still has resource?
  • 44. Thinking •  Is it possible to predict when will has pending job in the cluster? •  Can I predict the resource consumption at specific time and dynamic allocate to fully utilize cluster resource?
  • 45. Predict Resource Consumption And Allocate Resource •  Appropriate congurations for Container •  CPU bound / IO bound •  Queue resource consumption in the cluster •  Predict and allocate resource Container Size Job Characteristics Proper Allocate Resource to Queue Resource Prediction
  • 46. YARN resource consumption prediction Collect Metrics Data Processing Training Model Pre-procession Training Model Evaluate RMSE Model Prediction Prediction Queue Consumption
  • 47. Training Data Fields Description Process date date Ignore time hour: 0 ~ 23 feature working day 0: working day 1: non-working day feature weekday week day feature cluster_appsPending Pending apps in the cluster feature cluster_appsRunning Running apps in the cluster feature cluster_availableMB Available vmem in the cluster feature cluster_allocatedMB Allocated vmem in the cluster feature cluster_availableVcore Available vcore in the cluster feature cluster_allocatedVcore Allocated vcore in the cluster feature •  Data source: Job history log
  • 48. Training Data Fields Description Process queue_name Queue name feature minResources_memory Min vmem for queue feature minResources_vcores Min vcore for queue feature maxResources_memory Max vmem for queue feature maxResources_vcores Max vcore for queue feature numPendingApps Pending apps in queue feature numActiveApps Running apps in queue feature usedResources.memory Used vmem in queue feature usedResources.vcore Used vcore in queue feature label label (predict target) label
  • 49. Training Model •  Training Model: Linear Regression •  Predict: vcore
  • 50. Training Model •  Training model: RandomForest •  Predict: vcore •  Data source: Job history log •  Test data set: 109,736 •  Test mode: split 66% train, remainder test •  Attributes: 19 === Summary === Correlation coefficient 0.999 Mean absolute error 0.1262 Root mean squared error 0.8494 Relative absolute error 1.5905 % Root relative squared error 4.5017 % Total Number of Instances 37,310
  • 51. Training Model •  Training Model: Linear Regression •  Predict: vmemory
  • 52. Training Model •  Training model: RandomForest •  Predict: vmemory •  Data source: Job history log •  Test data set: 109,736 •  Test mode: split 66% train, remainder test •  Attributes: 19 === Summary === Correlation coefficient 0.9995 Mean absolute error 0.0003 Root mean squared error 0.0019 Relative absolute error 1.4174 % Root relative squared error 3.2014 % Total Number of Instances 37,310
  • 53. Training Model •  Training model: RandomForest •  Predict: Pending job •  Data source: Job history log •  Test data set: 122,120 •  Test mode: split 66% train, remainder test •  Attributes: 19 === Summary === Correlation coefficient 0.9917 Mean absolute error 0.0002 Root mean squared error 0.0054 Relative absolute error 7.9308 % Root relative squared error 14.4934 % Total Number of Instances 41,521
  • 54. AAribute Evaluation •  Predict: Pending jobs •  Attribute Evaluator: Information Gain •  Ranked attributes : ABribute Score maxResource_memory 1.14465 maxResource_vcore 1.04186 usedResource_memory 0.53004 usedResource_vcore 0.51167 minResource_memory 0.47563 numActiveApps 0.34418 minResource_vcore 0.3179
  • 55. Experiment Result •  According to the prediction result, we reallocate the resource of the queues which may has pending jobs on specific weekday. •  Experiment result: Pending jobs reduce 82% Pending jobs ratio Before 0.005 After 0.0009
  • 56. Experiment Result •  Something you should know: - The total of queues’ minResources should less than the cluster fair share - Queue may not gets its minResources immediately - Preemption kills resources from other Queues to satisfy minResources, but also means waste resources
  • 57. Experiment Result •  Something you should know: - Modify fair-scheduler.xml too frequently may cause ResourceManager weird - Failover ResourceManager will cause the jobs submit by oozie retry again - Does tight resource cluster need resource prediction?
  • 58. Conclusion •  Deep understand the architecture is the key of tuning and management. •  Think about are there any other tools good for my daily job? Even from different domain. •  Machine Learning has been used on many domains for prediction, it definitely can provide you different perspective.
  • 59. Q A