SlideShare uma empresa Scribd logo
1 de 61
OPTIMAL RESOURCE
PROVISIONING FOR RUNNING
MAPREDUCE PROGRAMS IN
THE CLOUD
Presented By:
Group Id: 29
Priyanka Sangtani
Anshul Aggarwal
Pooja Jain
PROBLEM STATEMENT
The problem at hand is defining a resource provisioning
framework for MapReduce jobs running in a cloud keeping in
mind performance goals such as
Resource utilization with
-optimal number of map and reduce slots
-improvements in execution time
-Highly scalable solution
This is a design issue related to software frameworks available
in cloud . Traditional provisioning frameworks provide the users
with defaults which do not lend well to Mapreduce jobs .
Such jobs are highly parallelizable and our proposed algorithm
aims to use this fact to provide highly optimized resource
provisioning suitable for Mapreduce.
MAPREDUCE OVERVIEW
 In a typical MapReduce framework, data are
divided into blocks and distributed across many
nodes in a cluster and the MapReduce framework
takes advantage of data locality by shipping
computation to data rather than moving data to
where it is processed.
 Most input data blocks to MapReduce applications
are located on the local node, so they can be
loaded very fast and reading multiple blocks can be
done on multiple nodes in parallel.
 Therefore, MapReduce can achieve very high
aggregate I/O bandwidth and data processing rate.
WHY MAPREDUCE OPTIMIZATION
 MapReduce programming paradigm lends itself well to most
data-intensive analytics jobs, given its ability to scale-out and
leverage several machines to parallel process data.
 Research has demonstrated that existing approaches to
provisioning other applications in the cloud are not immediately
relevant to MapReduce -based applications
 MapReduce jobs have over 180 configuration parameters . Too
high a value can potentially cause resource contention and
degrade overall performance. Setting a low value, on the other
hand, might under-utilize the resources, and once again reduce
performance.
 Each application has a different bottleneck resource
(CPU:Disk:Network), and different bottleneck resource
utilization, and thus needs to pick a different combination of
these parameters such that the bottleneck resource is
maximally utilized.
WORK FLOW OF PROPOSED SOLUTION
User
Application
Signature
Matching
Algorithm
SLO Based
Provisioning
Priority
Algorithm
Bottleneck
Removal
Database with
Signature
YesNo
Resource Provisioning
Framework
Optimal no. of map / reduce
slots
PROPOSED ALGORITHM
1.Signature Matching
A sample of input is run on the cloud to generate a resource consumption signature . This
signature is matched with a database. If a match is found, we can use the optimal configurations
stored for the matched signature else we move to SLO-based provisioning.
2. SLO based Resource Provisioning
Based on the number of maps and reduce jobs, available slots and time constraints , we
calculate the optimal number of maps and reduce jobs to run in parallel .
3. Priority Assignment
To give users a better control over provisioning, we can assign priorities in this stage
4. Skew Mitigation
Managing parallel partitions.
5. Bottle Neck Removal
The most common problem in parallel computation is bottleneck.
6. Deadlock detection removal
This stage deals with deadlock removal to improve execution time.
1 . SIGNATURE MATCHING
MATHEMATICAL MODEL
 Entire job run split into n (a pre-chosen number) intervals with
each interval having the same duration.
 For the ith interval, compute the average resource consumption for
each, rth resource. The resource types (us, sy, wa, id, bi, bo, ni,
no , sr = % of CPU in user time, system time, waiting time, ideal
time, disk block in, disk block out, network in and network out ,slow
ratio respectively)
 Generate a resource consumption signature set, Sr , for every rth
resource as
Srm = {Srm1, Srm2 , ..., Srmn }
 The signature distances between the generated signatures and
the signature of the databases is computed as
X2
(𝑆
𝑅1
𝑚 , 𝑆
𝑅1
𝑚 ) =
𝐼=1
𝑛
(𝑆
𝑅1
𝑚𝑖−𝑆
𝑅2
𝑚𝑖) 2
/(𝑆
𝑅1
𝑚𝑖 +𝑆
𝑅2
𝑚𝑖)
 χ2 represents the vector distance between two signatures for a
particular resource r in time-interval vector space. We compute
scalar addition of χ2 for all the resource types . Lower value of
sum of χ2 indicates more similar signatures. We choose the
configuration of the application that has the closest signature
distance sum to the new application.
ALGORITHM
1. Take a sample input IS of appropriate size from actual input.
2. Take a resource set RS .
3. Take the signature database with average distance between signatures
DAVG..
4 .Split the entire job run into n (a pre-chosen number) intervals with each
interval having the same duration.
5. For all the resource types in (us, sy, wa, id, bi, bo, ni, no ,sr )
6. For the ith interval from 1 to n
7. Compute the average resource consumption . We generate a
resource consumption signature set, Sr , for every rth resource as Srm =
{Srm1, Srm2 , ..., Srmn }.
8. Set min_distance = 10000.
9. For every signature S in database
10. Find the distance D between the calculated signature and S
11. If D < min_distance , set min_distance = D and Signature_matched
= S
12. Set precision value P
13. If D > P*DAVG , return no match found
14. Else return Signature_matched
2. SLO – BASED PROVISIONING
Given a MapReduce job J with input dataset D identify minimal combinations (S J
M
, S J
R) of map and reduce slots that can be allocated to job J so that it finishes within time
T?
Step I: Create a compact job profile that reflects all phases of a given job: map,
shuffle/sort and reduce phases.
Map Stage: (Mmin,Mavg,Mmax,AvgSizeinput
M , SelectivityM)
Shuffle Stage: (Sh1
avg, Sh1
max, Shtyp
avg, Shtyp
max)
Reduce Stage: (Rmin,Ravg ,SelectivityR)
Step II: There are three design choices according to the completion time-
1) T is targeted as a lower bound of the job completion time. Typically, this leads to
the least amount of resources allocated to the job for finishing within deadline T.
The lower bound corresponds to an ideal computation under allocated resources and
is rarely achievable in real environments.
2) T is targeted as an upper bound of the job completion time. Typically, this leads to a
more aggressive resource allocations and might lead to a job completion time that is
much smaller than T because worst case scenarios are also rare in production
settings.
3) Given time T is targeted as the average between lower and upper bounds on job
completion time. This more balanced resource allocation might provide a solution that
enables the job to complete within time T.
Mathematical Model –
Makespan Algo: The makespan of the greedy task assignment is at least n*avg /k and at most (n
− 1)*avg/k + max.
Suppose the dataset is partitioned into NJ
M map tasks and NJ
R reduce tasks. Let SJ
M and SJ
R be
the number of map and reduce slots.
By Makespan Theorem, the lower and upper bounds on the duration of the entire map stage
(denoted as Tlow
M and Tup
M respectively) are estimated as follows:
T low
M = NJ
M * Mavg/SJ
M
T up
M = (NJ
M− 1) * Mavg/SJ
M +Mmax
T low
sh = (NJ
r /SJ
r -1)* Shtyp
avg
T up
sh = ((NJ
r− 1) /SJ
r ) -1)* Shtyp
avg +Shtyp
max
T low
M = T low
M + Sh1
avg + T low
sh +T low
R
T up
M = T up
M + Sh1
avg + T up
sh +T up
R
T low
J = NJ
M·Mavg / SJ
M+ NJ
R·(Shtyp
avg+Ravg) / SJ
R+ Sh1
avg−Shtyp
avg
Tlow
j = Alow
J·NJM/SJ
M+ Blow
J·NJ
R /SJ
R+ Clow
J
Where
Alow
J = Mavg
Blow
J = Shtyp
avg+Ravg
Clow
J = Sh1
avg−Shtyp
avg
Taking Tlow
j as T (expected completion time),
T= Alow
J·NJM/SJ
M+ Blow
J·NJ
R /SJ
R+ Clow
J
In the algorithm, T is targeted as a lower bound of the job completion time. The algorithm sweeps
through the entire range of map slot allocations and finds the corresponding values of reduce slots that
are needed to complete the job within time T.
Resource allocation algorithm
Input:
Job profile of J
(NJ
M,NJ
R) ←Number of map and reduce tasks of J
(SM, SR) ←Total number of map and reduce slots in the cluster
T ←Deadline by which job must be completed
Output: P ←Set of plausible resource allocations SJ
M,SJ
R
Algorithm:
for SJ
M← MIN(NJ
M, SM) to 1 do
Solve the equation Alow
J·NJ
M /SJ
M+ Blow
J·NJ
RSJ
R= T − Clow
J for SJ
R
if 0 < SJ
R≤ SR then
P ← P ∪ (SJ
M, SJ
R)
else
// Job cannot be completed within deadline T
// with the allocated map slots
Break out of the loop
end if
end for
The complexity of the above proposed algorithm is O(min(NJ
M,Sm)) and thus linear in the number of map
slots.
3. PRIORITY ALGORITHM
 Workflow Priority
o prioritizes entire workflows
o increase spending on all workflows that are more important
and drop spending on less important workflows
o Importance may be implied by proximity to deadline, current
demand of anticipated output or whether the application is in a
test or production phase.
 Stage Priority
o Prioritizes different stages of a single workflow
o system splits a budget according to user-defined weights
o budget is split within the workflow across the different stages
o Spending more on phases where resources are more critical,
the overall utility of the workflow may be increased
MATHEMATICAL MODEL
 Workflow priority
o Lets say we have m workflow with weight vector w, i.e
w = [w1,w2…….wn]
o Total weight of job is
W= w1+w2…… wn
o Budget for workflow i is
bwi = bs* wi/W
Where bs is total budget of job.
 Stage Priority
o Lets say we have m stages with weight vector sw i.e
sw = [sw1,sw2…….swm]
o Total weight of workflow is
SW= sw1+sw2……swm
o Budget for stage i is
bswi = bw* swi/SW
Where bw is total budget of workflow.
ALGORITHM
1. Consider a job with n workflow and each workflow
consist of m stages.
2. User are asked to input total budget and workflow
priority and stage priority.
3. Low priority has value 1 and high priority has value 0.5
to spend double on high priority.
4. Calculate budget for each workflow i.e bwi = bs* wi/W
5. Use bwi to find resource share for a workflow
6. Calculate budget for each stage i.e bswi = bw* swi/SW
7. Use bswi to find resource share for a stage
8. Workflow or stage will be given more cost and time for
execution and thus high priority task have high
spending rate i.e high b/d ratio.
SKEW MITIGATION
 In addition, to support parallelism, partitions must be small enough that
several partitions can be processed in parallel. To avoid record skew,
select a partitioning function to keep each partition roughly the same size
 On each node, we applies the map operation to a prefix of the records in
each input file stored on that node.
 As the map function produces records, the node records information
about the intermediate data, such as how much larger or smaller it is
than the input and the number of records generated. It also stores
information about each intermediate key and the associated record's
size.
 It sends that metadata to the coordinator. The coordinator merges the
metadata from each of the nodes to estimate the intermediate data size.
It then uses this size, and the desired partition size, to compute the
number of partitions.
 Then, it performs a streaming merge-sort on the samples from each
node. Once all the sampled data is sorted, partition boundaries are
calculated based on the desired partition sizes. The result is a list of
“boundary keys" that define the edges of each partition.
BOTTLENECK REMOVAL
 A map-reduce system can
simultaneously run multiple
jobs competing for the node’s
resources and traffic bandwidth.
 These conflicts cause slowdown
in the execution of tasks. The
duration of each phase, and hence
the duration of the job is determined
by the slowest, or straggler task.
 The slowdowns of individual tasks are highly correlated with overall
job latencies.
 However, significant task slowdowns tend to indicate bottlenecks in
job execution as well.
MATHEMATICAL MODEL
Bottleneck detection
 Te
i is expected execution time of task i.
 Tr
i is running time of task i.
 TE
i>Tr
i means no bottleneck
 Tr
i – Te
i > t means bottleneck is present where t is a time which is derived from past data .If a task
is running for t more than expected time, bottleneck is detected.
Bottleneck Elimination
 ni- number of idle nodes, na- number of active nodes,f – boost factor
 To reduce bottleneck, we distribute task such that total spending is equal to average spending,
i.e. b/d.
 Spending at active node = b/d ∗ (1 + (ni/na) ∗ f)
 Spending at idle node = b/d ∗ (1 − f)
 E = na/na+ni*( b/d ∗ (1 + (ni/na) ∗ f)) + ni/na+ni*( b/d ∗ (1 − f))
= b/ na+ni*d(na + ni*f + ni – ni*f)
= b/ na+ni*d(na + ni)
= b/d
= Avg. Spending
ALGORITHM
 Bottleneck avoidance
Step 1: Compute task and node features
1. Run the task over cloud
2. Collect the performance traces after every 10 minutes and store the result in a file
Step 2: Compute slowdown factor
1. Compare current job trace with already completed job
2. Calculate slowdown factor which is ration of current job parameter to similar job
Step 3: Give slowdown factor of each job to scheduler
1. Scheduler schedule high slowdown job first
2. Scheduler don’t schedule high slowdown job to congested hardware node
 Bottleneck detection
Step 1 : Estimate execution time of each job using historical data
Step 2: Periodically compute time for which job is running
Step 3: Compare excepted execution time and running time
1. If TE
i>Tr
i ,no bottleneck.
2. Else If Tr
i – Te
i > t, bottleneck has occurred
 Bottleneck Elimination
To reduce execution time we can carry out Execution Bottleneck elimination
algorithm that will schedule redundant copies of the remaining tasks across
several nodes which do not have other work to perform
Bottleneck elimination algorithm
1. idle ← GETIDLENODES(nodes)
2. active ← nodes – idle
3. ni ← SIZE(idle)
4. na ← SIZE(active)
5. for each node ∈ active
node.spending ←b/d ∗ (1 + (ni/na) ∗ f)
6. for each node ∈ idle
node.spending ←b/d ∗ (1 − f)
where f is a boost factor whose value is between 0 and 1 and this is set by
user. b is budget and d is duration
DEADLOCK
A deadlock may occur between mappers and reducers with no progress in the job
when
 Initial available map/reduce slots were allocated to mappers
 Once few of mappers are completed, reducers started occupying few of the slots
 After a while ,all slots occupied by reducers.
 Since there were still mapper tasks not yet assigned any slot, the map phase never
completed.
 The system entered a deadlock state where reducers occupy all available slots, but
are waiting for mappers to be complete; mappers cannot move forward because of
no slot available.
Deadlock prevention:
Unlike existing MapReduce systems, which executes map and reduce tasks
concurrently in waves, we can implements the MapReduce programming model in two
phases of operation:
 Phase 1: Map and shuffle
The Reader stage reads records from an input disk and sends them to the Mapper
stage, which applies the map function to each record. As the map function produces
intermediate records, each record's key is hashed to determine the node to which it
should be sent and placed in a per destination buffer that is given to the sender when it
is full.
 Phase 2: Sort and reduce
In phase two, each partition must be sorted by key, and the reduce function must be
applied to groups of records with the same key.
Deadlock Detection:
 The deadlock detector periodically probes workers to see if they are waiting for a
memory allocation request to complete.
 If multiple probe cycles pass in which all workers are waiting for an allocation or are
idle, the deadlock detector informs the memory allocator that a deadlock has
occurred.
Deadlock Elimination
 Process Termination: One or more process involved in the deadlock may be
aborted. We can choose to abort all processes involved in the deadlock. This ensures
that deadlock is resolved with certainty and speed.
 Resource Preemption: Resources allocated to various processes may be
successively preempted and allocated to other processes until the deadlock is broken.
IMPLEMENTATION FRAMEWORK
 Apache Hadoop is an open source implementation of the MapReduce
programming model supported by Yahoo and used by google , Amazon etc
 It also includes the underlying Hadoop Distributed File System (HDFS).
 Hadoop has over 180 configuration parameters. Examples include number of
replicas of input data, number of parallel map/reduce tasks to run, number of
parallel connections for transferring data etc.
 Hadoop installation comes with a default set of values for all the parameters in
its configuration.
 Scheduling in Hadoop is performed by a master node
 Hadoop has a variety of schedulers. The original one schedules all jobs using a
FIFO queue in the master. Another one, Hadoop on Demand (HOD), creates
private MapReduce clusters dynamically and manages them using the Torque
batch scheduler
CHALLENGES IN MAPREDUCE SIMULATIONS
 The right level of abstraction.
 Data layout aware.
 Resource contention aware.
 Heterogeneity modeling.
 Resource heterogeneity is common in large clusters.
 Input dependence.
 Workload aware.
 Verification.
 Performance
Comparison of Map Reduce Simulators
Based on Language GUI Support Workload-
aware
Resource-
contention
aware
MRPerf Ns-2 JAVA Yes Yes Yes
Cardona et al. GridSim C No Yes No
Mumak Hadoop C No Yes No
SimMR From scratch - - Yes No
HSim From scratch - - No Yes
MRSim GridSim JAVA Yes No Yes
SimMapReduc
e
GridSim JAVA Yes No yes
 Prior simulators on evaluating schedulers are trace-driven and
aware of other jobs in a work-load, but are limited in that they
are not aware of resource contention, so tasks execution time
may not be accurate. Our algorithm optimizes resource
provisioning so we require resource-contention-aware
simulator.
 It is almost impractical to set up a very large cluster consisting
hundreds or thousands of nodes to measure the scalability of
an algorithm. Hadoop environment set up involves alterations
of a great number of parameters which are crucial to achieve
best performances. An obvious solution to the above problems
is to use a simulator which can simulate the Hadoop
environment; a simulator on one hand allows us to measure
scalability of MapReduce based applications easily and
quickly, on the other hand determines the effects of different
configurations of Hadoop setup on MapReduce based
applications behavior in terms of speed.
 MRPerf is implemented based on ns-2, a packet-level
network simulator, and its performance is much worse than
other simulators. It could not generate accurate results for
jobs of different type of algorithms or different cluster
configurations.
 No existing implementation of HSim is available so it will
require a lot of work to start from scratch.
 Most of the current ongoing works in cloud computing are
being done on the CloudSim simulator but since our problem
entails use of map reduce model and no implementation is
provided by CloudSim to support MapReduce , we are not
using it.
 MRSim is extending discrete event engine used SimJava to
accurately simulate the Hadoop environment. Using SimJava
we simulate interactions between different entities within
cluster. GridSim package is also used for network simulation.
It is written in Java programming language on top of SimJava.
MRSIM ARCHITECTURE
 MRSim model simulates network topology and
traffic using GridSim. On the other hand, it models
the rest of system entities using SimJava discrete
event engine. The System is designed using object
oriented based models.
 Each machine is part of Network Topology model.
Each machine can host Job Tracker process and
Task Tracker Process. However there is only one
Job Tracker per MapReduce Cluster. Each Task
Tracker Model can launch several Map and Reduce
tasks up to the maximum allowed number in the
configuration files.
WHAT IS SIMJAVA?
 SimJava is a discrete event, process oriented simulation
package. It is an API that augments Java with building blocks
for defining and running simulations.
 Each system is considered to be a set of interacting
processes or entities as they are referred to in SimJava. These
entities communicate with each other by passing events. The
simulation time progresses on the basis of these events.
 Progress is recorded as trace messages and saved in a file.
 As of version 2.0, SimJava has been augmented with
considerable statistical and reporting support.
CONSTRUCTING A SIMULATION INVOLVES :
 Coding the behavior of simulation entities - done by
extending the sim_entity class and using the body()
method.
 Adding instances of these entities using sim_system
object using sim_system.add(entity)
 linking entities ports together,using
sim_system.link_ports()
 finally,setting the simulation in motion using
sim_system.run().
GRIDSIM
 allows modelling and simulation of entities in parallel and distributed
computing (PDC) systems-users, applications, resources, and resource
brokers (schedulers) for design and evaluation of scheduling
algorithms.
 provides a comprehensive facility for creating different classes of
heterogeneous resources that can be aggregated using resource
brokers. for solving compute and data intensive applications. A
resource can be a single processor or multi-processor with shared or
distributed memory and managed by time or space shared schedulers.
The processing nodes within a resource can be heterogeneous in
terms of processing capability, configuration, and availability. The
resource brokers use scheduling algorithms or policies for mapping
jobs to resources to optimize system or user objectives depending on
their goals.
JACKSON MODEL
Jackson Api contains a lot of functionalities to read and
build json using java.
It has very powerful data binding capabilities and provides
a framework to serialize custom java objects to json string
and deserialize json string back to java objects.
 Json written with jackson can contain embedded class
information that helps in creating the complete object tree
during deserialization.
JACKSON API
//1. Convert Java object to JSON format
ObjectMapper mapper = new ObjectMapper();
mapper.writeValue(new File("c:user.json"), user);
//2. Convert JSON to Java object
ObjectMapper mapper = new ObjectMapper();
User user = mapper.readValue(new File("c:user.json"),
User.class);
JOB TRACKER LAYOUT
 The main components of the simulator is Job Tracker that
controls generating map and reduce tasks, monitors when
different phases complete, and producing the final results.
 Map task is started by Job Tracker. The following processes
take place;
• A Java VM is instantiated for the task.
• Data is read from the local disk or requested
remotely.
• Map, sort, and spill operations are performed on the
input data until all of it has been consumed.
• Background file system mergers are merging the
output data to reduce the number of output files to
one or few files.
• A message indicating the completion of the map
task is returned to the Job Tracker.
DEMO – MRSIM
COMPARISON PARAMETERS
 Number of map and reduce slots
 CPU Usage
 Hard-disk Utilization
 Average Mapper Time
 Average Reducer Time
 Execution Time
JOB PROFILES
Referred from Resource Provisioning Framework for MapReduce Jobs with
Performance Goals , Abhishek Verma1, Ludmila Cherkasova2, and
Roy H. Campbell
TIME DURATION FOR DIFFERENT PHASES
PROFILE NoOfMap,
NoOfReduce
T1 T2 T3
Profile1 7,10 SLO 1398 1344 1357
SIGN + PRIOR 1209 1207 1217
Profile2 7,10 SLO 1367 1368 1387
SIGN + PRIOR 1276 1256 1273
Profile3 3,12 SLO 1397 1380 1363
SIGN + PRIOR 1245 1288 1253
Profile4 12,16 SLO 1320 1402 1409
SIGN + PRIOR 1263 1285 1207
Profile5 46,14 SLO 1316 1368 1353
SIGN + PRIOR 1208 1254 1256
Profile6 12,2 SLO 1342 1376 1332
SIGN + PRIOR 1267 1265 1287
Profile7(Job can’t be completed) 22,33 SLO 472 450 430
SIGN + PRIOR 0 0 0
Profile8 16,12 SLO 1327 1396 1376
SIGN + PRIOR 1233 1265 1274
MEAN TIME OVERHEADS FOR VARIOUS PHASES
SLO FAILED(JOB CAN’T BE COMPLETED
WITHIN DEADLINE)
420
SLO EXECUTED 1334
Signature not found 1337
Signature found 937
Priority 331
COMPARISON OF BASE ALGORITHM VS PROPOSED
ALGORITHM
PROFILE NO OF
MAPPER
NO OF
REDU
CERS
BASE ALGO CPU USAGE HDD UTILIZATN TIME AV MAPPER
TIME
AV
REDUCER
TIME
OUR ALGO
Profile 1 60 1 0.00001429 0.00105 1919 28.021 238.179
0.0000020 0.00403 2372 25.313 853.76
Profile 2 7 10 0.000001653 0.001834 5200 291.21 316.163
0.0002732 0.003917 4095 283.891 112.045
Profile 3 7 10 0.000003592 0.0031320 3044 314.459 84.322
0.00913784 0.01550 4108 281.432 114.249
Profile 4 3 12 0.0000023 0.03093 4259 1143.458 69.098
0.0008095 0.01197 4066 425.292 108.949
CONTI…..
Profile 5 12 16 0.000015307 0.002802 5239 164.185 204.315
0.001846 0.022107 4240 286.6 124.45
Profile 6 46 14 0.000036771 0.0024045 4163 426. 536 117.796
0.0010386 0.01082 3171 44.416 105. 881
Profile 7 12 2 0.00021723 0.005321 3986 205.405 137.099
0.0003971 0.007538 2739 426.411 100.124
Profile 8 16 12 0.00010813 0.0028452 4136 426.987 75.338
0.00478452 0.0093604 2863 122.479 114.748
CPU UTILIZATION
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
Base Algorithm
Proposed Algorithm
HARD – DISK UTILIZATION
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Base Algorithm
Proposed Algorithm
EXECUTION TIME
0
1000
2000
3000
4000
5000
6000
Base Algorithm
Proposed Algorithm
AVERAGE MAPPER TIME
0
200
400
600
800
1000
1200
1400
Base Algorithm
Proposed Algorithm
AVERAGE REDUCER TIME
0
100
200
300
400
500
600
700
800
900
Base Algorithm
Proposed Algorithm
RESULTS FOR JOB PROFILE 1
GRAPHS FOR PROFILE 1
RESULTS FOR JOB PROFILE 2
GRAPHICAL COMPARISON FOR PROFILE 2
TRACE FOR EXECUTION
 INFO GUISimulator:114 - <init>- done
 Initialising...
 INFO HTopology:112 - initGridSim- Initializing GridSim package
 Initialising...
 INFO HSimulator:64 - initSimulator- creat new Result dir /home/hadoop/workspace/work/hadoop.simulator/results/26-27-
Apr-2010 19:57:55
 INFO HJobTracker:311 - createEntities- create topology
 INFO HJobTracker:314 - createEntities- config.Heartbeat:1.0, read topology.getName:rack 0
 INFO HJobTracker:318 - createEntities- init NetEnd from rack
 INFO GUISimulator:389 - mnuSimStartActionPerformed- simulator has started simulator
 INFO HSimulator:106 - startSimulator- Starting simulator version
 INFO HSimulator:117 - startSimulator- trace level200
 INFO HSimulator:120 - startSimulator- graph file: /home/hadoop/workspace/work/hadoop.simulator/results/26-27-Apr-
2010 19:57:55/graph.sjg
 INFO HSimulator:125 - startSimulator- going to call Sim_system.run()
 Entities started.
 Entity huser has no body().
 INFO HJobTracker:129 - body- start entity
 INFO SimoTreeCollector:94 - body- add rack {m1=m1}
 INFO GUISimulator:394 - mnuSimStopActionPerformed- going to stop simulator
 INFO HTopology:252 - stopSimulation- Stopping NetEnd Simulation
TRACE CONTINUED…
 INFO HJobTracker:622 - stopSimulation- send end of simualtion 10.0
 INFO InMemFSMergeThread:71 - body- m1-reduce-0-inMemFSMergeThread END_OF_SIMULATION 10.0
 INFO CPU:148 - body- cpu_m1 END_OF_SIMULATION 10.0
 INFO InMemFSMergeThread:71 - body- m1-reduce-0-inMemFSMergeThread END_OF_SIMULATION 10.0
 INFO HDD:148 - body- hdd_m1 END_OF_SIMULATION 10.0
 INFO InMemFSMergeThread:71 - body- m1-reduce-1-inMemFSMergeThread END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-reduce-0 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-map-1 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-map-2 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-map-0 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-reduce-1 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-map-3 END_OF_SIMULATION 10.0
 INFO NetEnd:100 - body- m1 end simulation at time 10.0
 INFO HTask:166 - body- m1-map-0 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-map-1 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-map-2 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-map-3 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-reduce-0 END_OF_SIMULATION 10.0
 INFO HTask:166 - body- m1-reduce-1 END_OF_SIMULATION 10.0
 INFO SimoTreeCollector:78 - body- simotree END_OF_SIMULATION 10.0
 INFO InMemFSMergeThread:71 - body- m1-reduce-1-inMemFSMergeThread END_OF_SIMULATION 10.0
OUTPUT SNAPSHOTS FOR PROPOSED
ALGORITHM
REFERENCES
[1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-reduce
clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing, 2012.
[2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud
Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and
Service Computing, 2011.
[3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in Cloud
Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012
[4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs
with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011
[5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”,
Communications of the ACM, Jan 2008
[6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In Proc.
of the 2009 Conference of the Center for Advanced Studies on Collaborative Research, 2009.
[7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the cloud in
Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009
[8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and Challenges” in
International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011
[9] R. Lammel, “Google’s MapReduce programming model – Revisited” in Journal of Science of
Computer Programming, Oct 2007
[10] R. P. Padhy, “Big Data Processing with Hadoop-MapReduce in Cloud Systems” In International
Journal of Cloud Computing and Services Science, vol. 2, Feb 2013.
[11] B. Palanisamy, A. Singh, L. Liu and B. Langston, "Cura: A Cost-Optimized Model for MapReduce
in a Cloud", Proc. of 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS
2013)
[12] A. Rasmussen, M. Conley, R. Kapoor, V. T. Lam, G. Porter, and A. Vahdat, “Themis: An I/O-
Efficient MapReduce”, Communications of the ACM, Oct 2012
[13] V. K. Reddy, B. T. Rao, Dr. L.S.S. Reddy, and P. S. Kiran ,” Research Issues in Cloud Computing”
in Global Journal of Computer Science and Technology, vol. 11, Jul 2011
[14] T. Sandholm and K. Lai, “MapReduce Optimization Using Regulated Dynamic Prioritization” in
Social Computing Laboratory, Hewlett-Packard Laboratories, 2011
[15] F. Tian, K. Chen,”Towards Optimal Resource Provisioning for Running MapReduce Programs in
Public Clouds”, in 4th Intl. Conference on Cloud Computing, IEEE, 2011
[16] Hadoop. http://hadoop.apache.org.
[17] Amazon Elastic MapReduce, http://aws.amazon.com/elasticmapreduce/

Mais conteúdo relacionado

Mais procurados

A Novel Technique in Software Engineering for Building Scalable Large Paralle...
A Novel Technique in Software Engineering for Building Scalable Large Paralle...A Novel Technique in Software Engineering for Building Scalable Large Paralle...
A Novel Technique in Software Engineering for Building Scalable Large Paralle...Eswar Publications
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Lino Possamai
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesNECST Lab @ Politecnico di Milano
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Pearson1e ch14 appendix_14_1
Pearson1e ch14 appendix_14_1Pearson1e ch14 appendix_14_1
Pearson1e ch14 appendix_14_1M.D.Giri Reddy
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesNECST Lab @ Politecnico di Milano
 
AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...
AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...
AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...csandit
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
 
Parallel programming
Parallel programmingParallel programming
Parallel programmingAnshul Sharma
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 

Mais procurados (20)

Aca11 bk2 ch9
Aca11 bk2 ch9Aca11 bk2 ch9
Aca11 bk2 ch9
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
 
3D-DRESD Polaris
3D-DRESD Polaris3D-DRESD Polaris
3D-DRESD Polaris
 
A Novel Technique in Software Engineering for Building Scalable Large Paralle...
A Novel Technique in Software Engineering for Building Scalable Large Paralle...A Novel Technique in Software Engineering for Building Scalable Large Paralle...
A Novel Technique in Software Engineering for Building Scalable Large Paralle...
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
Masters Report 3
Masters Report 3Masters Report 3
Masters Report 3
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Pearson1e ch14 appendix_14_1
Pearson1e ch14 appendix_14_1Pearson1e ch14 appendix_14_1
Pearson1e ch14 appendix_14_1
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 
AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...
AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...
AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
Parallel programming
Parallel programmingParallel programming
Parallel programming
 
RTS
RTSRTS
RTS
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Data science
Data scienceData science
Data science
 

Semelhante a OPTIMAL MAPREDUCE RESOURCE PROVISIONING IN THE CLOUD

Scheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic AlgorithmScheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic Algorithmiosrjce
 
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsScheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsLEGATO project
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.pptCheeWeiTan10
 
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...IRJET Journal
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSArchana Gopinath
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 
Performance analysis and randamized agoritham
Performance analysis and randamized agorithamPerformance analysis and randamized agoritham
Performance analysis and randamized agorithamlilyMalar1
 
An Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemAn Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemIRJET Journal
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Wei's notes on MapReduce Scheduling
Wei's notes on MapReduce SchedulingWei's notes on MapReduce Scheduling
Wei's notes on MapReduce SchedulingLu Wei
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 

Semelhante a OPTIMAL MAPREDUCE RESOURCE PROVISIONING IN THE CLOUD (20)

M017327378
M017327378M017327378
M017327378
 
Scheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic AlgorithmScheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic Algorithm
 
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsScheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Unit 2
Unit 2Unit 2
Unit 2
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Resource management
Resource managementResource management
Resource management
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Skyline queries
Skyline queriesSkyline queries
Skyline queries
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
Performance analysis and randamized agoritham
Performance analysis and randamized agorithamPerformance analysis and randamized agoritham
Performance analysis and randamized agoritham
 
An Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemAn Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing System
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Green scheduling
Green schedulingGreen scheduling
Green scheduling
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Wei's notes on MapReduce Scheduling
Wei's notes on MapReduce SchedulingWei's notes on MapReduce Scheduling
Wei's notes on MapReduce Scheduling
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 

Mais de Deanna Kosaraju

Speak Out and Change the World! Voices 2015
Speak Out and Change the World!   Voices 2015Speak Out and Change the World!   Voices 2015
Speak Out and Change the World! Voices 2015Deanna Kosaraju
 
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Deanna Kosaraju
 
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...Deanna Kosaraju
 
Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015Deanna Kosaraju
 
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Deanna Kosaraju
 
The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015Deanna Kosaraju
 
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015Deanna Kosaraju
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...Deanna Kosaraju
 
Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015Deanna Kosaraju
 
Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015Deanna Kosaraju
 
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...Deanna Kosaraju
 
ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015Deanna Kosaraju
 
Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015Deanna Kosaraju
 
Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015Deanna Kosaraju
 
Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015Deanna Kosaraju
 
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...Deanna Kosaraju
 
Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015Deanna Kosaraju
 
Agility and cloud computing
Agility and cloud computingAgility and cloud computing
Agility and cloud computingDeanna Kosaraju
 

Mais de Deanna Kosaraju (20)

Speak Out and Change the World! Voices 2015
Speak Out and Change the World!   Voices 2015Speak Out and Change the World!   Voices 2015
Speak Out and Change the World! Voices 2015
 
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
 
Change IT! Voices 2015
Change IT! Voices 2015Change IT! Voices 2015
Change IT! Voices 2015
 
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
 
Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015
 
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
 
The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015
 
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
 
Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015
 
Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015
 
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
 
ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015
 
Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015
 
Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015
 
Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015
 
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...
 
Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015
 
Agility and cloud computing
Agility and cloud computingAgility and cloud computing
Agility and cloud computing
 

Último

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 

Último (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 

OPTIMAL MAPREDUCE RESOURCE PROVISIONING IN THE CLOUD

  • 1. OPTIMAL RESOURCE PROVISIONING FOR RUNNING MAPREDUCE PROGRAMS IN THE CLOUD Presented By: Group Id: 29 Priyanka Sangtani Anshul Aggarwal Pooja Jain
  • 2. PROBLEM STATEMENT The problem at hand is defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Resource utilization with -optimal number of map and reduce slots -improvements in execution time -Highly scalable solution This is a design issue related to software frameworks available in cloud . Traditional provisioning frameworks provide the users with defaults which do not lend well to Mapreduce jobs . Such jobs are highly parallelizable and our proposed algorithm aims to use this fact to provide highly optimized resource provisioning suitable for Mapreduce.
  • 3. MAPREDUCE OVERVIEW  In a typical MapReduce framework, data are divided into blocks and distributed across many nodes in a cluster and the MapReduce framework takes advantage of data locality by shipping computation to data rather than moving data to where it is processed.  Most input data blocks to MapReduce applications are located on the local node, so they can be loaded very fast and reading multiple blocks can be done on multiple nodes in parallel.  Therefore, MapReduce can achieve very high aggregate I/O bandwidth and data processing rate.
  • 4.
  • 5. WHY MAPREDUCE OPTIMIZATION  MapReduce programming paradigm lends itself well to most data-intensive analytics jobs, given its ability to scale-out and leverage several machines to parallel process data.  Research has demonstrated that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications  MapReduce jobs have over 180 configuration parameters . Too high a value can potentially cause resource contention and degrade overall performance. Setting a low value, on the other hand, might under-utilize the resources, and once again reduce performance.  Each application has a different bottleneck resource (CPU:Disk:Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters such that the bottleneck resource is maximally utilized.
  • 6. WORK FLOW OF PROPOSED SOLUTION User Application Signature Matching Algorithm SLO Based Provisioning Priority Algorithm Bottleneck Removal Database with Signature YesNo Resource Provisioning Framework Optimal no. of map / reduce slots
  • 7. PROPOSED ALGORITHM 1.Signature Matching A sample of input is run on the cloud to generate a resource consumption signature . This signature is matched with a database. If a match is found, we can use the optimal configurations stored for the matched signature else we move to SLO-based provisioning. 2. SLO based Resource Provisioning Based on the number of maps and reduce jobs, available slots and time constraints , we calculate the optimal number of maps and reduce jobs to run in parallel . 3. Priority Assignment To give users a better control over provisioning, we can assign priorities in this stage 4. Skew Mitigation Managing parallel partitions. 5. Bottle Neck Removal The most common problem in parallel computation is bottleneck. 6. Deadlock detection removal This stage deals with deadlock removal to improve execution time.
  • 8. 1 . SIGNATURE MATCHING
  • 9. MATHEMATICAL MODEL  Entire job run split into n (a pre-chosen number) intervals with each interval having the same duration.  For the ith interval, compute the average resource consumption for each, rth resource. The resource types (us, sy, wa, id, bi, bo, ni, no , sr = % of CPU in user time, system time, waiting time, ideal time, disk block in, disk block out, network in and network out ,slow ratio respectively)  Generate a resource consumption signature set, Sr , for every rth resource as Srm = {Srm1, Srm2 , ..., Srmn }  The signature distances between the generated signatures and the signature of the databases is computed as X2 (𝑆 𝑅1 𝑚 , 𝑆 𝑅1 𝑚 ) = 𝐼=1 𝑛 (𝑆 𝑅1 𝑚𝑖−𝑆 𝑅2 𝑚𝑖) 2 /(𝑆 𝑅1 𝑚𝑖 +𝑆 𝑅2 𝑚𝑖)  χ2 represents the vector distance between two signatures for a particular resource r in time-interval vector space. We compute scalar addition of χ2 for all the resource types . Lower value of sum of χ2 indicates more similar signatures. We choose the configuration of the application that has the closest signature distance sum to the new application.
  • 10. ALGORITHM 1. Take a sample input IS of appropriate size from actual input. 2. Take a resource set RS . 3. Take the signature database with average distance between signatures DAVG.. 4 .Split the entire job run into n (a pre-chosen number) intervals with each interval having the same duration. 5. For all the resource types in (us, sy, wa, id, bi, bo, ni, no ,sr ) 6. For the ith interval from 1 to n 7. Compute the average resource consumption . We generate a resource consumption signature set, Sr , for every rth resource as Srm = {Srm1, Srm2 , ..., Srmn }. 8. Set min_distance = 10000. 9. For every signature S in database 10. Find the distance D between the calculated signature and S 11. If D < min_distance , set min_distance = D and Signature_matched = S 12. Set precision value P 13. If D > P*DAVG , return no match found 14. Else return Signature_matched
  • 11. 2. SLO – BASED PROVISIONING Given a MapReduce job J with input dataset D identify minimal combinations (S J M , S J R) of map and reduce slots that can be allocated to job J so that it finishes within time T? Step I: Create a compact job profile that reflects all phases of a given job: map, shuffle/sort and reduce phases. Map Stage: (Mmin,Mavg,Mmax,AvgSizeinput M , SelectivityM) Shuffle Stage: (Sh1 avg, Sh1 max, Shtyp avg, Shtyp max) Reduce Stage: (Rmin,Ravg ,SelectivityR) Step II: There are three design choices according to the completion time- 1) T is targeted as a lower bound of the job completion time. Typically, this leads to the least amount of resources allocated to the job for finishing within deadline T. The lower bound corresponds to an ideal computation under allocated resources and is rarely achievable in real environments. 2) T is targeted as an upper bound of the job completion time. Typically, this leads to a more aggressive resource allocations and might lead to a job completion time that is much smaller than T because worst case scenarios are also rare in production settings. 3) Given time T is targeted as the average between lower and upper bounds on job completion time. This more balanced resource allocation might provide a solution that enables the job to complete within time T.
  • 12. Mathematical Model – Makespan Algo: The makespan of the greedy task assignment is at least n*avg /k and at most (n − 1)*avg/k + max. Suppose the dataset is partitioned into NJ M map tasks and NJ R reduce tasks. Let SJ M and SJ R be the number of map and reduce slots. By Makespan Theorem, the lower and upper bounds on the duration of the entire map stage (denoted as Tlow M and Tup M respectively) are estimated as follows: T low M = NJ M * Mavg/SJ M T up M = (NJ M− 1) * Mavg/SJ M +Mmax T low sh = (NJ r /SJ r -1)* Shtyp avg T up sh = ((NJ r− 1) /SJ r ) -1)* Shtyp avg +Shtyp max T low M = T low M + Sh1 avg + T low sh +T low R T up M = T up M + Sh1 avg + T up sh +T up R T low J = NJ M·Mavg / SJ M+ NJ R·(Shtyp avg+Ravg) / SJ R+ Sh1 avg−Shtyp avg Tlow j = Alow J·NJM/SJ M+ Blow J·NJ R /SJ R+ Clow J Where Alow J = Mavg Blow J = Shtyp avg+Ravg Clow J = Sh1 avg−Shtyp avg Taking Tlow j as T (expected completion time), T= Alow J·NJM/SJ M+ Blow J·NJ R /SJ R+ Clow J
  • 13. In the algorithm, T is targeted as a lower bound of the job completion time. The algorithm sweeps through the entire range of map slot allocations and finds the corresponding values of reduce slots that are needed to complete the job within time T. Resource allocation algorithm Input: Job profile of J (NJ M,NJ R) ←Number of map and reduce tasks of J (SM, SR) ←Total number of map and reduce slots in the cluster T ←Deadline by which job must be completed Output: P ←Set of plausible resource allocations SJ M,SJ R Algorithm: for SJ M← MIN(NJ M, SM) to 1 do Solve the equation Alow J·NJ M /SJ M+ Blow J·NJ RSJ R= T − Clow J for SJ R if 0 < SJ R≤ SR then P ← P ∪ (SJ M, SJ R) else // Job cannot be completed within deadline T // with the allocated map slots Break out of the loop end if end for The complexity of the above proposed algorithm is O(min(NJ M,Sm)) and thus linear in the number of map slots.
  • 14. 3. PRIORITY ALGORITHM  Workflow Priority o prioritizes entire workflows o increase spending on all workflows that are more important and drop spending on less important workflows o Importance may be implied by proximity to deadline, current demand of anticipated output or whether the application is in a test or production phase.  Stage Priority o Prioritizes different stages of a single workflow o system splits a budget according to user-defined weights o budget is split within the workflow across the different stages o Spending more on phases where resources are more critical, the overall utility of the workflow may be increased
  • 15. MATHEMATICAL MODEL  Workflow priority o Lets say we have m workflow with weight vector w, i.e w = [w1,w2…….wn] o Total weight of job is W= w1+w2…… wn o Budget for workflow i is bwi = bs* wi/W Where bs is total budget of job.  Stage Priority o Lets say we have m stages with weight vector sw i.e sw = [sw1,sw2…….swm] o Total weight of workflow is SW= sw1+sw2……swm o Budget for stage i is bswi = bw* swi/SW Where bw is total budget of workflow.
  • 16. ALGORITHM 1. Consider a job with n workflow and each workflow consist of m stages. 2. User are asked to input total budget and workflow priority and stage priority. 3. Low priority has value 1 and high priority has value 0.5 to spend double on high priority. 4. Calculate budget for each workflow i.e bwi = bs* wi/W 5. Use bwi to find resource share for a workflow 6. Calculate budget for each stage i.e bswi = bw* swi/SW 7. Use bswi to find resource share for a stage 8. Workflow or stage will be given more cost and time for execution and thus high priority task have high spending rate i.e high b/d ratio.
  • 17. SKEW MITIGATION  In addition, to support parallelism, partitions must be small enough that several partitions can be processed in parallel. To avoid record skew, select a partitioning function to keep each partition roughly the same size  On each node, we applies the map operation to a prefix of the records in each input file stored on that node.  As the map function produces records, the node records information about the intermediate data, such as how much larger or smaller it is than the input and the number of records generated. It also stores information about each intermediate key and the associated record's size.  It sends that metadata to the coordinator. The coordinator merges the metadata from each of the nodes to estimate the intermediate data size. It then uses this size, and the desired partition size, to compute the number of partitions.  Then, it performs a streaming merge-sort on the samples from each node. Once all the sampled data is sorted, partition boundaries are calculated based on the desired partition sizes. The result is a list of “boundary keys" that define the edges of each partition.
  • 18. BOTTLENECK REMOVAL  A map-reduce system can simultaneously run multiple jobs competing for the node’s resources and traffic bandwidth.  These conflicts cause slowdown in the execution of tasks. The duration of each phase, and hence the duration of the job is determined by the slowest, or straggler task.  The slowdowns of individual tasks are highly correlated with overall job latencies.  However, significant task slowdowns tend to indicate bottlenecks in job execution as well.
  • 19. MATHEMATICAL MODEL Bottleneck detection  Te i is expected execution time of task i.  Tr i is running time of task i.  TE i>Tr i means no bottleneck  Tr i – Te i > t means bottleneck is present where t is a time which is derived from past data .If a task is running for t more than expected time, bottleneck is detected. Bottleneck Elimination  ni- number of idle nodes, na- number of active nodes,f – boost factor  To reduce bottleneck, we distribute task such that total spending is equal to average spending, i.e. b/d.  Spending at active node = b/d ∗ (1 + (ni/na) ∗ f)  Spending at idle node = b/d ∗ (1 − f)  E = na/na+ni*( b/d ∗ (1 + (ni/na) ∗ f)) + ni/na+ni*( b/d ∗ (1 − f)) = b/ na+ni*d(na + ni*f + ni – ni*f) = b/ na+ni*d(na + ni) = b/d = Avg. Spending
  • 20. ALGORITHM  Bottleneck avoidance Step 1: Compute task and node features 1. Run the task over cloud 2. Collect the performance traces after every 10 minutes and store the result in a file Step 2: Compute slowdown factor 1. Compare current job trace with already completed job 2. Calculate slowdown factor which is ration of current job parameter to similar job Step 3: Give slowdown factor of each job to scheduler 1. Scheduler schedule high slowdown job first 2. Scheduler don’t schedule high slowdown job to congested hardware node  Bottleneck detection Step 1 : Estimate execution time of each job using historical data Step 2: Periodically compute time for which job is running Step 3: Compare excepted execution time and running time 1. If TE i>Tr i ,no bottleneck. 2. Else If Tr i – Te i > t, bottleneck has occurred
  • 21.  Bottleneck Elimination To reduce execution time we can carry out Execution Bottleneck elimination algorithm that will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform Bottleneck elimination algorithm 1. idle ← GETIDLENODES(nodes) 2. active ← nodes – idle 3. ni ← SIZE(idle) 4. na ← SIZE(active) 5. for each node ∈ active node.spending ←b/d ∗ (1 + (ni/na) ∗ f) 6. for each node ∈ idle node.spending ←b/d ∗ (1 − f) where f is a boost factor whose value is between 0 and 1 and this is set by user. b is budget and d is duration
  • 22. DEADLOCK A deadlock may occur between mappers and reducers with no progress in the job when  Initial available map/reduce slots were allocated to mappers  Once few of mappers are completed, reducers started occupying few of the slots  After a while ,all slots occupied by reducers.  Since there were still mapper tasks not yet assigned any slot, the map phase never completed.  The system entered a deadlock state where reducers occupy all available slots, but are waiting for mappers to be complete; mappers cannot move forward because of no slot available. Deadlock prevention: Unlike existing MapReduce systems, which executes map and reduce tasks concurrently in waves, we can implements the MapReduce programming model in two phases of operation:  Phase 1: Map and shuffle The Reader stage reads records from an input disk and sends them to the Mapper stage, which applies the map function to each record. As the map function produces intermediate records, each record's key is hashed to determine the node to which it should be sent and placed in a per destination buffer that is given to the sender when it is full.
  • 23.  Phase 2: Sort and reduce In phase two, each partition must be sorted by key, and the reduce function must be applied to groups of records with the same key. Deadlock Detection:  The deadlock detector periodically probes workers to see if they are waiting for a memory allocation request to complete.  If multiple probe cycles pass in which all workers are waiting for an allocation or are idle, the deadlock detector informs the memory allocator that a deadlock has occurred. Deadlock Elimination  Process Termination: One or more process involved in the deadlock may be aborted. We can choose to abort all processes involved in the deadlock. This ensures that deadlock is resolved with certainty and speed.  Resource Preemption: Resources allocated to various processes may be successively preempted and allocated to other processes until the deadlock is broken.
  • 24. IMPLEMENTATION FRAMEWORK  Apache Hadoop is an open source implementation of the MapReduce programming model supported by Yahoo and used by google , Amazon etc  It also includes the underlying Hadoop Distributed File System (HDFS).  Hadoop has over 180 configuration parameters. Examples include number of replicas of input data, number of parallel map/reduce tasks to run, number of parallel connections for transferring data etc.  Hadoop installation comes with a default set of values for all the parameters in its configuration.  Scheduling in Hadoop is performed by a master node  Hadoop has a variety of schedulers. The original one schedules all jobs using a FIFO queue in the master. Another one, Hadoop on Demand (HOD), creates private MapReduce clusters dynamically and manages them using the Torque batch scheduler
  • 25. CHALLENGES IN MAPREDUCE SIMULATIONS  The right level of abstraction.  Data layout aware.  Resource contention aware.  Heterogeneity modeling.  Resource heterogeneity is common in large clusters.  Input dependence.  Workload aware.  Verification.  Performance
  • 26. Comparison of Map Reduce Simulators Based on Language GUI Support Workload- aware Resource- contention aware MRPerf Ns-2 JAVA Yes Yes Yes Cardona et al. GridSim C No Yes No Mumak Hadoop C No Yes No SimMR From scratch - - Yes No HSim From scratch - - No Yes MRSim GridSim JAVA Yes No Yes SimMapReduc e GridSim JAVA Yes No yes
  • 27.  Prior simulators on evaluating schedulers are trace-driven and aware of other jobs in a work-load, but are limited in that they are not aware of resource contention, so tasks execution time may not be accurate. Our algorithm optimizes resource provisioning so we require resource-contention-aware simulator.  It is almost impractical to set up a very large cluster consisting hundreds or thousands of nodes to measure the scalability of an algorithm. Hadoop environment set up involves alterations of a great number of parameters which are crucial to achieve best performances. An obvious solution to the above problems is to use a simulator which can simulate the Hadoop environment; a simulator on one hand allows us to measure scalability of MapReduce based applications easily and quickly, on the other hand determines the effects of different configurations of Hadoop setup on MapReduce based applications behavior in terms of speed.
  • 28.  MRPerf is implemented based on ns-2, a packet-level network simulator, and its performance is much worse than other simulators. It could not generate accurate results for jobs of different type of algorithms or different cluster configurations.  No existing implementation of HSim is available so it will require a lot of work to start from scratch.  Most of the current ongoing works in cloud computing are being done on the CloudSim simulator but since our problem entails use of map reduce model and no implementation is provided by CloudSim to support MapReduce , we are not using it.  MRSim is extending discrete event engine used SimJava to accurately simulate the Hadoop environment. Using SimJava we simulate interactions between different entities within cluster. GridSim package is also used for network simulation. It is written in Java programming language on top of SimJava.
  • 30.  MRSim model simulates network topology and traffic using GridSim. On the other hand, it models the rest of system entities using SimJava discrete event engine. The System is designed using object oriented based models.  Each machine is part of Network Topology model. Each machine can host Job Tracker process and Task Tracker Process. However there is only one Job Tracker per MapReduce Cluster. Each Task Tracker Model can launch several Map and Reduce tasks up to the maximum allowed number in the configuration files.
  • 31. WHAT IS SIMJAVA?  SimJava is a discrete event, process oriented simulation package. It is an API that augments Java with building blocks for defining and running simulations.  Each system is considered to be a set of interacting processes or entities as they are referred to in SimJava. These entities communicate with each other by passing events. The simulation time progresses on the basis of these events.  Progress is recorded as trace messages and saved in a file.  As of version 2.0, SimJava has been augmented with considerable statistical and reporting support.
  • 32. CONSTRUCTING A SIMULATION INVOLVES :  Coding the behavior of simulation entities - done by extending the sim_entity class and using the body() method.  Adding instances of these entities using sim_system object using sim_system.add(entity)  linking entities ports together,using sim_system.link_ports()  finally,setting the simulation in motion using sim_system.run().
  • 33. GRIDSIM  allows modelling and simulation of entities in parallel and distributed computing (PDC) systems-users, applications, resources, and resource brokers (schedulers) for design and evaluation of scheduling algorithms.  provides a comprehensive facility for creating different classes of heterogeneous resources that can be aggregated using resource brokers. for solving compute and data intensive applications. A resource can be a single processor or multi-processor with shared or distributed memory and managed by time or space shared schedulers. The processing nodes within a resource can be heterogeneous in terms of processing capability, configuration, and availability. The resource brokers use scheduling algorithms or policies for mapping jobs to resources to optimize system or user objectives depending on their goals.
  • 34. JACKSON MODEL Jackson Api contains a lot of functionalities to read and build json using java. It has very powerful data binding capabilities and provides a framework to serialize custom java objects to json string and deserialize json string back to java objects.  Json written with jackson can contain embedded class information that helps in creating the complete object tree during deserialization.
  • 35. JACKSON API //1. Convert Java object to JSON format ObjectMapper mapper = new ObjectMapper(); mapper.writeValue(new File("c:user.json"), user); //2. Convert JSON to Java object ObjectMapper mapper = new ObjectMapper(); User user = mapper.readValue(new File("c:user.json"), User.class);
  • 37.  The main components of the simulator is Job Tracker that controls generating map and reduce tasks, monitors when different phases complete, and producing the final results.  Map task is started by Job Tracker. The following processes take place; • A Java VM is instantiated for the task. • Data is read from the local disk or requested remotely. • Map, sort, and spill operations are performed on the input data until all of it has been consumed. • Background file system mergers are merging the output data to reduce the number of output files to one or few files. • A message indicating the completion of the map task is returned to the Job Tracker.
  • 39. COMPARISON PARAMETERS  Number of map and reduce slots  CPU Usage  Hard-disk Utilization  Average Mapper Time  Average Reducer Time  Execution Time
  • 40. JOB PROFILES Referred from Resource Provisioning Framework for MapReduce Jobs with Performance Goals , Abhishek Verma1, Ludmila Cherkasova2, and Roy H. Campbell
  • 41. TIME DURATION FOR DIFFERENT PHASES PROFILE NoOfMap, NoOfReduce T1 T2 T3 Profile1 7,10 SLO 1398 1344 1357 SIGN + PRIOR 1209 1207 1217 Profile2 7,10 SLO 1367 1368 1387 SIGN + PRIOR 1276 1256 1273 Profile3 3,12 SLO 1397 1380 1363 SIGN + PRIOR 1245 1288 1253 Profile4 12,16 SLO 1320 1402 1409 SIGN + PRIOR 1263 1285 1207 Profile5 46,14 SLO 1316 1368 1353 SIGN + PRIOR 1208 1254 1256 Profile6 12,2 SLO 1342 1376 1332 SIGN + PRIOR 1267 1265 1287 Profile7(Job can’t be completed) 22,33 SLO 472 450 430 SIGN + PRIOR 0 0 0 Profile8 16,12 SLO 1327 1396 1376 SIGN + PRIOR 1233 1265 1274
  • 42. MEAN TIME OVERHEADS FOR VARIOUS PHASES SLO FAILED(JOB CAN’T BE COMPLETED WITHIN DEADLINE) 420 SLO EXECUTED 1334 Signature not found 1337 Signature found 937 Priority 331
  • 43. COMPARISON OF BASE ALGORITHM VS PROPOSED ALGORITHM PROFILE NO OF MAPPER NO OF REDU CERS BASE ALGO CPU USAGE HDD UTILIZATN TIME AV MAPPER TIME AV REDUCER TIME OUR ALGO Profile 1 60 1 0.00001429 0.00105 1919 28.021 238.179 0.0000020 0.00403 2372 25.313 853.76 Profile 2 7 10 0.000001653 0.001834 5200 291.21 316.163 0.0002732 0.003917 4095 283.891 112.045 Profile 3 7 10 0.000003592 0.0031320 3044 314.459 84.322 0.00913784 0.01550 4108 281.432 114.249 Profile 4 3 12 0.0000023 0.03093 4259 1143.458 69.098 0.0008095 0.01197 4066 425.292 108.949
  • 44. CONTI….. Profile 5 12 16 0.000015307 0.002802 5239 164.185 204.315 0.001846 0.022107 4240 286.6 124.45 Profile 6 46 14 0.000036771 0.0024045 4163 426. 536 117.796 0.0010386 0.01082 3171 44.416 105. 881 Profile 7 12 2 0.00021723 0.005321 3986 205.405 137.099 0.0003971 0.007538 2739 426.411 100.124 Profile 8 16 12 0.00010813 0.0028452 4136 426.987 75.338 0.00478452 0.0093604 2863 122.479 114.748
  • 46. HARD – DISK UTILIZATION 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Base Algorithm Proposed Algorithm
  • 50. RESULTS FOR JOB PROFILE 1
  • 52. RESULTS FOR JOB PROFILE 2
  • 54. TRACE FOR EXECUTION  INFO GUISimulator:114 - <init>- done  Initialising...  INFO HTopology:112 - initGridSim- Initializing GridSim package  Initialising...  INFO HSimulator:64 - initSimulator- creat new Result dir /home/hadoop/workspace/work/hadoop.simulator/results/26-27- Apr-2010 19:57:55  INFO HJobTracker:311 - createEntities- create topology  INFO HJobTracker:314 - createEntities- config.Heartbeat:1.0, read topology.getName:rack 0  INFO HJobTracker:318 - createEntities- init NetEnd from rack  INFO GUISimulator:389 - mnuSimStartActionPerformed- simulator has started simulator  INFO HSimulator:106 - startSimulator- Starting simulator version  INFO HSimulator:117 - startSimulator- trace level200  INFO HSimulator:120 - startSimulator- graph file: /home/hadoop/workspace/work/hadoop.simulator/results/26-27-Apr- 2010 19:57:55/graph.sjg  INFO HSimulator:125 - startSimulator- going to call Sim_system.run()  Entities started.  Entity huser has no body().  INFO HJobTracker:129 - body- start entity  INFO SimoTreeCollector:94 - body- add rack {m1=m1}  INFO GUISimulator:394 - mnuSimStopActionPerformed- going to stop simulator  INFO HTopology:252 - stopSimulation- Stopping NetEnd Simulation
  • 55. TRACE CONTINUED…  INFO HJobTracker:622 - stopSimulation- send end of simualtion 10.0  INFO InMemFSMergeThread:71 - body- m1-reduce-0-inMemFSMergeThread END_OF_SIMULATION 10.0  INFO CPU:148 - body- cpu_m1 END_OF_SIMULATION 10.0  INFO InMemFSMergeThread:71 - body- m1-reduce-0-inMemFSMergeThread END_OF_SIMULATION 10.0  INFO HDD:148 - body- hdd_m1 END_OF_SIMULATION 10.0  INFO InMemFSMergeThread:71 - body- m1-reduce-1-inMemFSMergeThread END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-reduce-0 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-map-1 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-map-2 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-map-0 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-reduce-1 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-map-3 END_OF_SIMULATION 10.0  INFO NetEnd:100 - body- m1 end simulation at time 10.0  INFO HTask:166 - body- m1-map-0 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-map-1 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-map-2 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-map-3 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-reduce-0 END_OF_SIMULATION 10.0  INFO HTask:166 - body- m1-reduce-1 END_OF_SIMULATION 10.0  INFO SimoTreeCollector:78 - body- simotree END_OF_SIMULATION 10.0  INFO InMemFSMergeThread:71 - body- m1-reduce-1-inMemFSMergeThread END_OF_SIMULATION 10.0
  • 56. OUTPUT SNAPSHOTS FOR PROPOSED ALGORITHM
  • 57.
  • 58.
  • 59.
  • 60. REFERENCES [1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-reduce clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing, 2012. [2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and Service Computing, 2011. [3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in Cloud Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012 [4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011 [5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Jan 2008 [6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In Proc. of the 2009 Conference of the Center for Advanced Studies on Collaborative Research, 2009. [7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the cloud in Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009 [8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and Challenges” in International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011
  • 61. [9] R. Lammel, “Google’s MapReduce programming model – Revisited” in Journal of Science of Computer Programming, Oct 2007 [10] R. P. Padhy, “Big Data Processing with Hadoop-MapReduce in Cloud Systems” In International Journal of Cloud Computing and Services Science, vol. 2, Feb 2013. [11] B. Palanisamy, A. Singh, L. Liu and B. Langston, "Cura: A Cost-Optimized Model for MapReduce in a Cloud", Proc. of 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2013) [12] A. Rasmussen, M. Conley, R. Kapoor, V. T. Lam, G. Porter, and A. Vahdat, “Themis: An I/O- Efficient MapReduce”, Communications of the ACM, Oct 2012 [13] V. K. Reddy, B. T. Rao, Dr. L.S.S. Reddy, and P. S. Kiran ,” Research Issues in Cloud Computing” in Global Journal of Computer Science and Technology, vol. 11, Jul 2011 [14] T. Sandholm and K. Lai, “MapReduce Optimization Using Regulated Dynamic Prioritization” in Social Computing Laboratory, Hewlett-Packard Laboratories, 2011 [15] F. Tian, K. Chen,”Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds”, in 4th Intl. Conference on Cloud Computing, IEEE, 2011 [16] Hadoop. http://hadoop.apache.org. [17] Amazon Elastic MapReduce, http://aws.amazon.com/elasticmapreduce/