OPTIMAL MAPREDUCE RESOURCE PROVISIONING IN THE CLOUD

OPTIMAL RESOURCE
PROVISIONING FOR RUNNING
MAPREDUCE PROGRAMS IN
THE CLOUD
Presented By:
Group Id: 29
Priyanka Sangtani
Anshul Aggarwal
Pooja Jain

PROBLEM STATEMENT
The problem at hand is defining a resource provisioning
framework for MapReduce jobs running in a cloud keeping in
mind performance goals such as
Resource utilization with
-optimal number of map and reduce slots
-improvements in execution time
-Highly scalable solution
This is a design issue related to software frameworks available
in cloud . Traditional provisioning frameworks provide the users
with defaults which do not lend well to Mapreduce jobs .
Such jobs are highly parallelizable and our proposed algorithm
aims to use this fact to provide highly optimized resource
provisioning suitable for Mapreduce.

MAPREDUCE OVERVIEW
 In a typical MapReduce framework, data are
divided into blocks and distributed across many
nodes in a cluster and the MapReduce framework
takes advantage of data locality by shipping
computation to data rather than moving data to
where it is processed.
 Most input data blocks to MapReduce applications
are located on the local node, so they can be
loaded very fast and reading multiple blocks can be
done on multiple nodes in parallel.
 Therefore, MapReduce can achieve very high
aggregate I/O bandwidth and data processing rate.

WHY MAPREDUCE OPTIMIZATION
 MapReduce programming paradigm lends itself well to most
data-intensive analytics jobs, given its ability to scale-out and
leverage several machines to parallel process data.
 Research has demonstrated that existing approaches to
provisioning other applications in the cloud are not immediately
relevant to MapReduce -based applications
 MapReduce jobs have over 180 configuration parameters . Too
high a value can potentially cause resource contention and
degrade overall performance. Setting a low value, on the other
hand, might under-utilize the resources, and once again reduce
performance.
 Each application has a different bottleneck resource
(CPU:Disk:Network), and different bottleneck resource
utilization, and thus needs to pick a different combination of
these parameters such that the bottleneck resource is
maximally utilized.

WORK FLOW OF PROPOSED SOLUTION
User
Application
Signature
Matching
Algorithm
SLO Based
Provisioning
Priority
Algorithm
Bottleneck
Removal
Database with
Signature
YesNo
Resource Provisioning
Framework
Optimal no. of map / reduce
slots

PROPOSED ALGORITHM
1.Signature Matching
A sample of input is run on the cloud to generate a resource consumption signature . This
signature is matched with a database. If a match is found, we can use the optimal configurations
stored for the matched signature else we move to SLO-based provisioning.
2. SLO based Resource Provisioning
Based on the number of maps and reduce jobs, available slots and time constraints , we
calculate the optimal number of maps and reduce jobs to run in parallel .
3. Priority Assignment
To give users a better control over provisioning, we can assign priorities in this stage
4. Skew Mitigation
Managing parallel partitions.
5. Bottle Neck Removal
The most common problem in parallel computation is bottleneck.
6. Deadlock detection removal
This stage deals with deadlock removal to improve execution time.

MATHEMATICAL MODEL
 Entire job run split into n (a pre-chosen number) intervals with
each interval having the same duration.
 For the ith interval, compute the average resource consumption for
each, rth resource. The resource types (us, sy, wa, id, bi, bo, ni,
no , sr = % of CPU in user time, system time, waiting time, ideal
time, disk block in, disk block out, network in and network out ,slow
ratio respectively)
 Generate a resource consumption signature set, Sr , for every rth
resource as
Srm = {Srm1, Srm2 , ..., Srmn }
 The signature distances between the generated signatures and
the signature of the databases is computed as
X2
(𝑆
𝑅1
𝑚 , 𝑆
𝑅1
𝑚 ) =
𝐼=1
𝑛
(𝑆
𝑅1
𝑚𝑖−𝑆
𝑅2
𝑚𝑖) 2
/(𝑆
𝑅1
𝑚𝑖 +𝑆
𝑅2
𝑚𝑖)
 χ2 represents the vector distance between two signatures for a
particular resource r in time-interval vector space. We compute
scalar addition of χ2 for all the resource types . Lower value of
sum of χ2 indicates more similar signatures. We choose the
configuration of the application that has the closest signature
distance sum to the new application.

ALGORITHM
1. Take a sample input IS of appropriate size from actual input.
2. Take a resource set RS .
3. Take the signature database with average distance between signatures
DAVG..
4 .Split the entire job run into n (a pre-chosen number) intervals with each
interval having the same duration.
5. For all the resource types in (us, sy, wa, id, bi, bo, ni, no ,sr )
6. For the ith interval from 1 to n
7. Compute the average resource consumption . We generate a
resource consumption signature set, Sr , for every rth resource as Srm =
{Srm1, Srm2 , ..., Srmn }.
8. Set min_distance = 10000.
9. For every signature S in database
10. Find the distance D between the calculated signature and S
11. If D < min_distance , set min_distance = D and Signature_matched
= S
12. Set precision value P
13. If D > P*DAVG , return no match found
14. Else return Signature_matched

2. SLO – BASED PROVISIONING
Given a MapReduce job J with input dataset D identify minimal combinations (S J
M
, S J
R) of map and reduce slots that can be allocated to job J so that it finishes within time
T?
Step I: Create a compact job profile that reflects all phases of a given job: map,
shuffle/sort and reduce phases.
Map Stage: (Mmin,Mavg,Mmax,AvgSizeinput
M , SelectivityM)
Shuffle Stage: (Sh1
avg, Sh1
max, Shtyp
avg, Shtyp
max)
Reduce Stage: (Rmin,Ravg ,SelectivityR)
Step II: There are three design choices according to the completion time-
1) T is targeted as a lower bound of the job completion time. Typically, this leads to
the least amount of resources allocated to the job for finishing within deadline T.
The lower bound corresponds to an ideal computation under allocated resources and
is rarely achievable in real environments.
2) T is targeted as an upper bound of the job completion time. Typically, this leads to a
more aggressive resource allocations and might lead to a job completion time that is
much smaller than T because worst case scenarios are also rare in production
settings.
3) Given time T is targeted as the average between lower and upper bounds on job
completion time. This more balanced resource allocation might provide a solution that
enables the job to complete within time T.

Mathematical Model –
Makespan Algo: The makespan of the greedy task assignment is at least n*avg /k and at most (n
− 1)*avg/k + max.
Suppose the dataset is partitioned into NJ
M map tasks and NJ
R reduce tasks. Let SJ
M and SJ
R be
the number of map and reduce slots.
By Makespan Theorem, the lower and upper bounds on the duration of the entire map stage
(denoted as Tlow
M and Tup
M respectively) are estimated as follows:
T low
M = NJ
M * Mavg/SJ
M
T up
M = (NJ
M− 1) * Mavg/SJ
M +Mmax
T low
sh = (NJ
r /SJ
r -1)* Shtyp
avg
T up
sh = ((NJ
r− 1) /SJ
r ) -1)* Shtyp
avg +Shtyp
max
T low
M = T low
M + Sh1
avg + T low
sh +T low
R
T up
M = T up
M + Sh1
avg + T up
sh +T up
R
T low
J = NJ
M·Mavg / SJ
M+ NJ
R·(Shtyp
avg+Ravg) / SJ
R+ Sh1
avg−Shtyp
avg
Tlow
j = Alow
J·NJM/SJ
M+ Blow
J·NJ
R /SJ
R+ Clow
J
Where
Alow
J = Mavg
Blow
J = Shtyp
avg+Ravg
Clow
J = Sh1
avg−Shtyp
avg
Taking Tlow
j as T (expected completion time),
T= Alow
J·NJM/SJ
M+ Blow
J·NJ
R /SJ
R+ Clow
J

In the algorithm, T is targeted as a lower bound of the job completion time. The algorithm sweeps
through the entire range of map slot allocations and finds the corresponding values of reduce slots that
are needed to complete the job within time T.
Resource allocation algorithm
Input:
Job profile of J
(NJ
M,NJ
R) ←Number of map and reduce tasks of J
(SM, SR) ←Total number of map and reduce slots in the cluster
T ←Deadline by which job must be completed
Output: P ←Set of plausible resource allocations SJ
M,SJ
R
Algorithm:
for SJ
M← MIN(NJ
M, SM) to 1 do
Solve the equation Alow
J·NJ
M /SJ
M+ Blow
J·NJ
RSJ
R= T − Clow
J for SJ
R
if 0 < SJ
R≤ SR then
P ← P ∪ (SJ
M, SJ
R)
else
// Job cannot be completed within deadline T
// with the allocated map slots
Break out of the loop
end if
end for
The complexity of the above proposed algorithm is O(min(NJ
M,Sm)) and thus linear in the number of map
slots.

3. PRIORITY ALGORITHM
 Workflow Priority
o prioritizes entire workflows
o increase spending on all workflows that are more important
and drop spending on less important workflows
o Importance may be implied by proximity to deadline, current
demand of anticipated output or whether the application is in a
test or production phase.
 Stage Priority
o Prioritizes different stages of a single workflow
o system splits a budget according to user-defined weights
o budget is split within the workflow across the different stages
o Spending more on phases where resources are more critical,
the overall utility of the workflow may be increased

MATHEMATICAL MODEL
 Workflow priority
o Lets say we have m workflow with weight vector w, i.e
w = [w1,w2…….wn]
o Total weight of job is
W= w1+w2…… wn
o Budget for workflow i is
bwi = bs* wi/W
Where bs is total budget of job.
 Stage Priority
o Lets say we have m stages with weight vector sw i.e
sw = [sw1,sw2…….swm]
o Total weight of workflow is
SW= sw1+sw2……swm
o Budget for stage i is
bswi = bw* swi/SW
Where bw is total budget of workflow.

ALGORITHM
1. Consider a job with n workflow and each workflow
consist of m stages.
2. User are asked to input total budget and workflow
priority and stage priority.
3. Low priority has value 1 and high priority has value 0.5
to spend double on high priority.
4. Calculate budget for each workflow i.e bwi = bs* wi/W
5. Use bwi to find resource share for a workflow
6. Calculate budget for each stage i.e bswi = bw* swi/SW
7. Use bswi to find resource share for a stage
8. Workflow or stage will be given more cost and time for
execution and thus high priority task have high
spending rate i.e high b/d ratio.

SKEW MITIGATION
 In addition, to support parallelism, partitions must be small enough that
several partitions can be processed in parallel. To avoid record skew,
select a partitioning function to keep each partition roughly the same size
 On each node, we applies the map operation to a prefix of the records in
each input file stored on that node.
 As the map function produces records, the node records information
about the intermediate data, such as how much larger or smaller it is
than the input and the number of records generated. It also stores
information about each intermediate key and the associated record's
size.
 It sends that metadata to the coordinator. The coordinator merges the
metadata from each of the nodes to estimate the intermediate data size.
It then uses this size, and the desired partition size, to compute the
number of partitions.
 Then, it performs a streaming merge-sort on the samples from each
node. Once all the sampled data is sorted, partition boundaries are
calculated based on the desired partition sizes. The result is a list of
“boundary keys" that define the edges of each partition.

BOTTLENECK REMOVAL
 A map-reduce system can
simultaneously run multiple
jobs competing for the node’s
resources and traffic bandwidth.
 These conflicts cause slowdown
in the execution of tasks. The
duration of each phase, and hence
the duration of the job is determined
by the slowest, or straggler task.
 The slowdowns of individual tasks are highly correlated with overall
job latencies.
 However, significant task slowdowns tend to indicate bottlenecks in
job execution as well.

MATHEMATICAL MODEL
Bottleneck detection
 Te
i is expected execution time of task i.
 Tr
i is running time of task i.
 TE
i>Tr
i means no bottleneck
 Tr
i – Te
i > t means bottleneck is present where t is a time which is derived from past data .If a task
is running for t more than expected time, bottleneck is detected.
Bottleneck Elimination
 ni- number of idle nodes, na- number of active nodes,f – boost factor
 To reduce bottleneck, we distribute task such that total spending is equal to average spending,
i.e. b/d.
 Spending at active node = b/d ∗ (1 + (ni/na) ∗ f)
 Spending at idle node = b/d ∗ (1 − f)
 E = na/na+ni*( b/d ∗ (1 + (ni/na) ∗ f)) + ni/na+ni*( b/d ∗ (1 − f))
= b/ na+ni*d(na + ni*f + ni – ni*f)
= b/ na+ni*d(na + ni)
= b/d
= Avg. Spending

ALGORITHM
 Bottleneck avoidance
Step 1: Compute task and node features
1. Run the task over cloud
2. Collect the performance traces after every 10 minutes and store the result in a file
Step 2: Compute slowdown factor
1. Compare current job trace with already completed job
2. Calculate slowdown factor which is ration of current job parameter to similar job
Step 3: Give slowdown factor of each job to scheduler
1. Scheduler schedule high slowdown job first
2. Scheduler don’t schedule high slowdown job to congested hardware node
 Bottleneck detection
Step 1 : Estimate execution time of each job using historical data
Step 2: Periodically compute time for which job is running
Step 3: Compare excepted execution time and running time
1. If TE
i>Tr
i ,no bottleneck.
2. Else If Tr
i – Te
i > t, bottleneck has occurred

 Bottleneck Elimination
To reduce execution time we can carry out Execution Bottleneck elimination
algorithm that will schedule redundant copies of the remaining tasks across
several nodes which do not have other work to perform
Bottleneck elimination algorithm
1. idle ← GETIDLENODES(nodes)
2. active ← nodes – idle
3. ni ← SIZE(idle)
4. na ← SIZE(active)
5. for each node ∈ active
node.spending ←b/d ∗ (1 + (ni/na) ∗ f)
6. for each node ∈ idle
node.spending ←b/d ∗ (1 − f)
where f is a boost factor whose value is between 0 and 1 and this is set by
user. b is budget and d is duration

DEADLOCK
A deadlock may occur between mappers and reducers with no progress in the job
when
 Initial available map/reduce slots were allocated to mappers
 Once few of mappers are completed, reducers started occupying few of the slots
 After a while ,all slots occupied by reducers.
 Since there were still mapper tasks not yet assigned any slot, the map phase never
completed.
 The system entered a deadlock state where reducers occupy all available slots, but
are waiting for mappers to be complete; mappers cannot move forward because of
no slot available.
Deadlock prevention:
Unlike existing MapReduce systems, which executes map and reduce tasks
concurrently in waves, we can implements the MapReduce programming model in two
phases of operation:
 Phase 1: Map and shuffle
The Reader stage reads records from an input disk and sends them to the Mapper
stage, which applies the map function to each record. As the map function produces
intermediate records, each record's key is hashed to determine the node to which it
should be sent and placed in a per destination buffer that is given to the sender when it
is full.

 Phase 2: Sort and reduce
In phase two, each partition must be sorted by key, and the reduce function must be
applied to groups of records with the same key.
Deadlock Detection:
 The deadlock detector periodically probes workers to see if they are waiting for a
memory allocation request to complete.
 If multiple probe cycles pass in which all workers are waiting for an allocation or are
idle, the deadlock detector informs the memory allocator that a deadlock has
occurred.
Deadlock Elimination
 Process Termination: One or more process involved in the deadlock may be
aborted. We can choose to abort all processes involved in the deadlock. This ensures
that deadlock is resolved with certainty and speed.
 Resource Preemption: Resources allocated to various processes may be
successively preempted and allocated to other processes until the deadlock is broken.

IMPLEMENTATION FRAMEWORK
 Apache Hadoop is an open source implementation of the MapReduce
programming model supported by Yahoo and used by google , Amazon etc
 It also includes the underlying Hadoop Distributed File System (HDFS).
 Hadoop has over 180 configuration parameters. Examples include number of
replicas of input data, number of parallel map/reduce tasks to run, number of
parallel connections for transferring data etc.
 Hadoop installation comes with a default set of values for all the parameters in
its configuration.
 Scheduling in Hadoop is performed by a master node
 Hadoop has a variety of schedulers. The original one schedules all jobs using a
FIFO queue in the master. Another one, Hadoop on Demand (HOD), creates
private MapReduce clusters dynamically and manages them using the Torque
batch scheduler

CHALLENGES IN MAPREDUCE SIMULATIONS
 The right level of abstraction.
 Data layout aware.
 Resource contention aware.
 Heterogeneity modeling.
 Resource heterogeneity is common in large clusters.
 Input dependence.
 Workload aware.
 Verification.
 Performance

Comparison of Map Reduce Simulators
Based on Language GUI Support Workload-
aware
Resource-
contention
aware
MRPerf Ns-2 JAVA Yes Yes Yes
Cardona et al. GridSim C No Yes No
Mumak Hadoop C No Yes No
SimMR From scratch - - Yes No
HSim From scratch - - No Yes
MRSim GridSim JAVA Yes No Yes
SimMapReduc
e
GridSim JAVA Yes No yes

 Prior simulators on evaluating schedulers are trace-driven and
aware of other jobs in a work-load, but are limited in that they
are not aware of resource contention, so tasks execution time
may not be accurate. Our algorithm optimizes resource
provisioning so we require resource-contention-aware
simulator.
 It is almost impractical to set up a very large cluster consisting
hundreds or thousands of nodes to measure the scalability of
an algorithm. Hadoop environment set up involves alterations
of a great number of parameters which are crucial to achieve
best performances. An obvious solution to the above problems
is to use a simulator which can simulate the Hadoop
environment; a simulator on one hand allows us to measure
scalability of MapReduce based applications easily and
quickly, on the other hand determines the effects of different
configurations of Hadoop setup on MapReduce based
applications behavior in terms of speed.

 MRPerf is implemented based on ns-2, a packet-level
network simulator, and its performance is much worse than
other simulators. It could not generate accurate results for
jobs of different type of algorithms or different cluster
configurations.
 No existing implementation of HSim is available so it will
require a lot of work to start from scratch.
 Most of the current ongoing works in cloud computing are
being done on the CloudSim simulator but since our problem
entails use of map reduce model and no implementation is
provided by CloudSim to support MapReduce , we are not
using it.
 MRSim is extending discrete event engine used SimJava to
accurately simulate the Hadoop environment. Using SimJava
we simulate interactions between different entities within
cluster. GridSim package is also used for network simulation.
It is written in Java programming language on top of SimJava.

 MRSim model simulates network topology and
traffic using GridSim. On the other hand, it models
the rest of system entities using SimJava discrete
event engine. The System is designed using object
oriented based models.
 Each machine is part of Network Topology model.
Each machine can host Job Tracker process and
Task Tracker Process. However there is only one
Job Tracker per MapReduce Cluster. Each Task
Tracker Model can launch several Map and Reduce
tasks up to the maximum allowed number in the
configuration files.

WHAT IS SIMJAVA?
 SimJava is a discrete event, process oriented simulation
package. It is an API that augments Java with building blocks
for defining and running simulations.
 Each system is considered to be a set of interacting
processes or entities as they are referred to in SimJava. These
entities communicate with each other by passing events. The
simulation time progresses on the basis of these events.
 Progress is recorded as trace messages and saved in a file.
 As of version 2.0, SimJava has been augmented with
considerable statistical and reporting support.

CONSTRUCTING A SIMULATION INVOLVES :
 Coding the behavior of simulation entities - done by
extending the sim_entity class and using the body()
method.
 Adding instances of these entities using sim_system
object using sim_system.add(entity)
 linking entities ports together,using
sim_system.link_ports()
 finally,setting the simulation in motion using
sim_system.run().

GRIDSIM
 allows modelling and simulation of entities in parallel and distributed
computing (PDC) systems-users, applications, resources, and resource
brokers (schedulers) for design and evaluation of scheduling
algorithms.
 provides a comprehensive facility for creating different classes of
heterogeneous resources that can be aggregated using resource
brokers. for solving compute and data intensive applications. A
resource can be a single processor or multi-processor with shared or
distributed memory and managed by time or space shared schedulers.
The processing nodes within a resource can be heterogeneous in
terms of processing capability, configuration, and availability. The
resource brokers use scheduling algorithms or policies for mapping
jobs to resources to optimize system or user objectives depending on
their goals.

JACKSON MODEL
Jackson Api contains a lot of functionalities to read and
build json using java.
It has very powerful data binding capabilities and provides
a framework to serialize custom java objects to json string
and deserialize json string back to java objects.
 Json written with jackson can contain embedded class
information that helps in creating the complete object tree
during deserialization.

JACKSON API
//1. Convert Java object to JSON format
ObjectMapper mapper = new ObjectMapper();
mapper.writeValue(new File("c:user.json"), user);
//2. Convert JSON to Java object
ObjectMapper mapper = new ObjectMapper();
User user = mapper.readValue(new File("c:user.json"),
User.class);

 The main components of the simulator is Job Tracker that
controls generating map and reduce tasks, monitors when
different phases complete, and producing the final results.
 Map task is started by Job Tracker. The following processes
take place;
• A Java VM is instantiated for the task.
• Data is read from the local disk or requested
remotely.
• Map, sort, and spill operations are performed on the
input data until all of it has been consumed.
• Background file system mergers are merging the
output data to reduce the number of output files to
one or few files.
• A message indicating the completion of the map
task is returned to the Job Tracker.

COMPARISON PARAMETERS
 Number of map and reduce slots
 CPU Usage
 Hard-disk Utilization
 Average Mapper Time
 Average Reducer Time
 Execution Time

JOB PROFILES
Referred from Resource Provisioning Framework for MapReduce Jobs with
Performance Goals , Abhishek Verma1, Ludmila Cherkasova2, and
Roy H. Campbell

TIME DURATION FOR DIFFERENT PHASES
PROFILE NoOfMap,
NoOfReduce
T1 T2 T3
Profile1 7,10 SLO 1398 1344 1357
SIGN + PRIOR 1209 1207 1217
Profile2 7,10 SLO 1367 1368 1387
SIGN + PRIOR 1276 1256 1273
Profile3 3,12 SLO 1397 1380 1363
SIGN + PRIOR 1245 1288 1253
Profile4 12,16 SLO 1320 1402 1409
SIGN + PRIOR 1263 1285 1207
Profile5 46,14 SLO 1316 1368 1353
SIGN + PRIOR 1208 1254 1256
Profile6 12,2 SLO 1342 1376 1332
SIGN + PRIOR 1267 1265 1287
Profile7(Job can’t be completed) 22,33 SLO 472 450 430
SIGN + PRIOR 0 0 0
Profile8 16,12 SLO 1327 1396 1376
SIGN + PRIOR 1233 1265 1274

MEAN TIME OVERHEADS FOR VARIOUS PHASES
SLO FAILED(JOB CAN’T BE COMPLETED
WITHIN DEADLINE)
420
SLO EXECUTED 1334
Signature not found 1337
Signature found 937
Priority 331

COMPARISON OF BASE ALGORITHM VS PROPOSED
ALGORITHM
PROFILE NO OF
MAPPER
NO OF
REDU
CERS
BASE ALGO CPU USAGE HDD UTILIZATN TIME AV MAPPER
TIME
AV
REDUCER
TIME
OUR ALGO
Profile 1 60 1 0.00001429 0.00105 1919 28.021 238.179
0.0000020 0.00403 2372 25.313 853.76
Profile 2 7 10 0.000001653 0.001834 5200 291.21 316.163
0.0002732 0.003917 4095 283.891 112.045
Profile 3 7 10 0.000003592 0.0031320 3044 314.459 84.322
0.00913784 0.01550 4108 281.432 114.249
Profile 4 3 12 0.0000023 0.03093 4259 1143.458 69.098
0.0008095 0.01197 4066 425.292 108.949

CONTI…..
Profile 5 12 16 0.000015307 0.002802 5239 164.185 204.315
0.001846 0.022107 4240 286.6 124.45
Profile 6 46 14 0.000036771 0.0024045 4163 426. 536 117.796
0.0010386 0.01082 3171 44.416 105. 881
Profile 7 12 2 0.00021723 0.005321 3986 205.405 137.099
0.0003971 0.007538 2739 426.411 100.124
Profile 8 16 12 0.00010813 0.0028452 4136 426.987 75.338
0.00478452 0.0093604 2863 122.479 114.748

CPU UTILIZATION
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
Base Algorithm
Proposed Algorithm

HARD – DISK UTILIZATION
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Base Algorithm
Proposed Algorithm

EXECUTION TIME
0
1000
2000
3000
4000
5000
6000
Base Algorithm
Proposed Algorithm

AVERAGE MAPPER TIME
0
200
400
600
800
1000
1200
1400
Base Algorithm
Proposed Algorithm

AVERAGE REDUCER TIME
0
100
200
300
400
500
600
700
800
900
Base Algorithm
Proposed Algorithm

GRAPHICAL COMPARISON FOR PROFILE 2

TRACE FOR EXECUTION
 INFO GUISimulator:114 - <init>- done
 Initialising...
 INFO HTopology:112 - initGridSim- Initializing GridSim package
 Initialising...
 INFO HSimulator:64 - initSimulator- creat new Result dir /home/hadoop/workspace/work/hadoop.simulator/results/26-27-
Apr-2010 19:57:55
 INFO HJobTracker:311 - createEntities- create topology
 INFO HJobTracker:314 - createEntities- config.Heartbeat:1.0, read topology.getName:rack 0
 INFO HJobTracker:318 - createEntities- init NetEnd from rack
 INFO GUISimulator:389 - mnuSimStartActionPerformed- simulator has started simulator
 INFO HSimulator:106 - startSimulator- Starting simulator version
 INFO HSimulator:117 - startSimulator- trace level200
 INFO HSimulator:120 - startSimulator- graph file: /home/hadoop/workspace/work/hadoop.simulator/results/26-27-Apr-
2010 19:57:55/graph.sjg
 INFO HSimulator:125 - startSimulator- going to call Sim_system.run()
 Entities started.
 Entity huser has no body().
 INFO HJobTracker:129 - body- start entity
 INFO SimoTreeCollector:94 - body- add rack {m1=m1}
 INFO GUISimulator:394 - mnuSimStopActionPerformed- going to stop simulator
 INFO HTopology:252 - stopSimulation- Stopping NetEnd Simulation

TRACE CONTINUED…
 INFO HJobTracker:622 - stopSimulation- send end of simualtion 10.0
 INFO InMemFSMergeThread:71 - body- m1-reduce-0-inMemFSMergeThread END_OF_SIMULATION 10.0
 INFO CPU:148 - body- cpu_m1 END_OF_SIMULATION 10.0
 INFO HDD:148 - body- hdd_m1 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-reduce-0 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-map-1 END_OF_SIMULATION 10.0
 INFO NetEnd:100 - body- m1 end simulation at time 10.0
 INFO SimoTreeCollector:78 - body- simotree END_OF_SIMULATION 10.0

OUTPUT SNAPSHOTS FOR PROPOSED
ALGORITHM

REFERENCES
[1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-reduce
clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing, 2012.
[2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud
Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and
Service Computing, 2011.
[3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in Cloud
Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012
[4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs
with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011
[5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”,
Communications of the ACM, Jan 2008
[6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In Proc.
of the 2009 Conference of the Center for Advanced Studies on Collaborative Research, 2009.
[7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the cloud in
Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009
[8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and Challenges” in
International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011

[9] R. Lammel, “Google’s MapReduce programming model – Revisited” in Journal of Science of
Computer Programming, Oct 2007
[10] R. P. Padhy, “Big Data Processing with Hadoop-MapReduce in Cloud Systems” In International
Journal of Cloud Computing and Services Science, vol. 2, Feb 2013.
[11] B. Palanisamy, A. Singh, L. Liu and B. Langston, "Cura: A Cost-Optimized Model for MapReduce
in a Cloud", Proc. of 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS
2013)
[12] A. Rasmussen, M. Conley, R. Kapoor, V. T. Lam, G. Porter, and A. Vahdat, “Themis: An I/O-
Efficient MapReduce”, Communications of the ACM, Oct 2012
[13] V. K. Reddy, B. T. Rao, Dr. L.S.S. Reddy, and P. S. Kiran ,” Research Issues in Cloud Computing”
in Global Journal of Computer Science and Technology, vol. 11, Jul 2011
[14] T. Sandholm and K. Lai, “MapReduce Optimization Using Regulated Dynamic Prioritization” in
Social Computing Laboratory, Hewlett-Packard Laboratories, 2011
[15] F. Tian, K. Chen,”Towards Optimal Resource Provisioning for Running MapReduce Programs in
Public Clouds”, in 4th Intl. Conference on Cloud Computing, IEEE, 2011
[16] Hadoop. http://hadoop.apache.org.
[17] Amazon Elastic MapReduce, http://aws.amazon.com/elasticmapreduce/

OPTIMAL MAPREDUCE RESOURCE PROVISIONING IN THE CLOUD

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a OPTIMAL MAPREDUCE RESOURCE PROVISIONING IN THE CLOUD

Semelhante a OPTIMAL MAPREDUCE RESOURCE PROVISIONING IN THE CLOUD (20)

Mais de Deanna Kosaraju

Mais de Deanna Kosaraju (20)

Último

Último (20)

OPTIMAL MAPREDUCE RESOURCE PROVISIONING IN THE CLOUD