SlideShare a Scribd company logo
1 of 12
Download to read offline
Scheduling Shared Scans of Large Data Files

                     Parag Agrawal                                      Daniel Kifer                    Christopher Olston
                    Stanford University                               Yahoo! Research                      Yahoo! Research




ABSTRACT                                                                          the web), and the communication is minimal (distributive
We study how best to schedule scans of large data files, in                        and algebraic aggregation functions enable early aggregation
the presence of many simultaneous requests to a common                            on the Map side of the job, and the data transmitted to the
set of files. The objective is to maximize the overall rate of                     Reduce side is small). Many jobs even disable the Reduce
processing these files, by sharing scans of the same file as                        component, because they do not require global processing
aggressively as possible, without imposing undue wait time                        (e.g., generate a hash-based synopsis of every document in
on individual jobs. This scheduling problem arises in batch                       a large collection).
data processing environments such as Map-Reduce systems,                             The execution time of these jobs is dominated by scanning
some of which handle tens of thousands of processing re-                          the input file. If the number of unique input files is small
quests daily, over a shared set of files.                                          relative to the number of daily jobs (e.g., in a search engine
   As we demonstrate, conventional scheduling techniques                          company many jobs process the web crawl, user click log,
such as shortest-job-first do not perform well in the presence                     and search query log), then it is desirable to amortize the
of cross-job sharing opportunities. We derive a new family                        work of scanning one of these files across multiple jobs. Un-
of scheduling policies specifically targeted to sharable work-                     fortunately, caching is not good enough because often these
loads. Our scheduling policies revolve around the notion                          data sets are so large that they do not fit in memory, even
that, all else being equal, it is good to schedule nonsharable                    if spread across a large cluster of machines.
scans ahead of ones that can share IO work with future jobs,                         Cooperative scans [6, 8, 21] can help here: multiple jobs
if the arrival rate of sharable future jobs is expected to be                     that require scanning the same file can be executed simulta-
high. We evaluate our policies via simulation over varied                         neously, with the scanning performed once and the scanned
synthetic and real workloads, and demonstrate significant                          data fed into each job’s processing component. The work on
performance gains compared with conventional scheduling                           cooperative scans has focused on mechanisms to realize IO
approaches.                                                                       savings across multiple co-executing jobs. However there is
                                                                                  another opportunity here: In the Map-Reduce context jobs
                                                                                  tend to run for a long time, and users do not expect quick
1.    INTRODUCTION                                                                turnaround. It is acceptable to reorder pending jobs, within
   As disk seeks become increasingly expensive relative to                        a reasonable limit on delaying individual jobs, if doing so
sequential access, data processing systems are being archi-                       can improve the total amount of useful work performed by
tected to favor bulk sequential scans of large files. Database,                    the system.
warehouse and mining systems have incorporated scan-                                 In this paper we study how to schedule jobs that can ben-
centric access methods for a long time, but at the mo-                            efit from shared scans over a common set of files. To our
ment the most prominent example of scan-centric archi-                            knowledge this scheduling problem has not been posed be-
tectures is Map-Reduce [4]. Map-Reduce systems execute                            fore. Existing scheduling techniques such as shortest-job-
UDF-enhanced group-by programs over extremely large, dis-                         first do not necessarily work well in the presence of sharable
tributed files. Other architectures in this space include                          jobs, and it is not obvious how to design ones that do work
Dryad [10] and River [1].                                                         well. We illustrate these points via a series of informal ex-
   Large Map-Reduce installations handle tens of thousands                        amples (rigorous formal analysis follows).
of jobs daily, where a job consists of a scan of a large file ac-
companied by some processing and perhaps communication                            1.1    Motivating Examples
work. In many cases the processing is relatively light (e.g.,
count the number of times Britney Spears is mentioned on                          Example 1
Permission to copy without fee all or part of this material is granted provided   Suppose the system’s work queue contains two pending jobs,
that the copies are not made or distributed for direct commercial advantage,      J1 and J2 , which are unrelated (i.e., they scan different files),
the VLDB copyright notice and the title of the publication and its date appear,   and hence there is no benefit in executing them jointly.
and notice is given that copying is by permission of the Very Large Data          Therefore we execute them sequentially, and we must de-
Base Endowment. To copy otherwise, or to republish, to post on servers            cide which one to execute first. We might consider execut-
or to redistribute to lists, requires a fee and/or special permission from the    ing them in order of arrival (FIFO), or perhaps in order of
publisher, ACM.
VLDB ‘08, August 24-30, 2008, Auckland, New Zealand                               expected running time (a policy known as shortest-job-first
Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00.                       scheduling, which aims for low average response time in non-
sharable workloads). If J1 arrived slightly earlier and has a    2.    RELATED WORK
slightly shorter execution time than J2 , then both FIFO           We are not aware of any prior work that addresses the
and shortest-job-first would schedule J1 first. This decision,     problem studied in this paper. That said, there is a tremen-
which is made without taking sharing into account, seems         dous amount of work, in both the database and scheduling
reasonable because J1 and J2 are unrelated.                      theory communities, that is peripherally related. We survey
   However, one might want to consider the fact that addi-       this work below.
tional jobs may arrive in the queue while J1 and J2 are being
executed. Since future jobs may be sharable with J1 or J2 ,      2.1    Database Literature
they can influence the optimal execution order of J1 and J2 .        Prior work on cooperative scans [6, 8, 21] focused on mech-
Even if one does not anticipate the exact arrival schedule of    anisms for sharing scans across jobs or queries that get ex-
future jobs, a simple stochastic model of future job arrivals    ecuted at the same time. Our work is complementary: we
can influence the decision of which of J1 or J2 to execute        consider how to schedule a queue of pending jobs to ensure
first.                                                            that sharable jobs get executed together and can benefit
   Suppose J1 scans file F1 , and J2 scans file F2 . Let λi        from cooperative scan techniques.
denote the frequency with which jobs that scan Fi are sub-          Gupta et al. [7] study how to select an execution order for
mitted. In our example, if λ1 > λ2 , then all else being         enqueued jobs, to maximize the chance that data cached on
equal it might make sense to schedule J2 first. While J2 is       behalf of one job can be reused for a subsequent job. That
executing, new jobs that are sharable with J1 may arrive,        work only takes into account jobs that are already in the
permitting us to amortize J1 ’s work across multiple jobs.       queue, whereas our work focuses on scheduling in view of
This amortization of work, in turn, can lead to lower av-        anticipated future jobs.
erage job response times going forward. The schedule we
produced by considering future job arrival rates differs from     2.2    Scheduling Literature
the one produced by FIFO and shortest-job-first.
                                                                    Scheduling theory is a vast field with countless variations
                                                                 on the scheduling problem, including various performance
                                                                 metrics, machine environments (such as single machine, par-
Example 2
                                                                 allel machines, and shop), and constraints (such as release
In a more subtle scenario, suppose instead that λ1 = λ2 .        times, deadlines, precedence constraints, and preemption)
Suppose F1 is 1 TB in size, and F2 is 10 TB. Assume              [11]. Some of the earliest complexity results for scheduling
each job’s execution time is dominated by scanning the file.      problems are given in [13]. In particular, the problem of
Hence, J2 takes about ten times as long to execute as J1 .       minimizing the sum of completion times on a single proces-
   Now, which one of J1 and J2 should we execute first?           sor in the presence of release dates (i.e. job arrival times)
Perhaps J1 should be executed first because J2 can benefit         is NP-hard. On the other hand, minimizing the maximum
more from sharing, and postponing J2 ’s execution permits        absolute or relative wait times can be done in polynomial
additional, sharable F2 jobs to accumulate in the queue. On      time using the algorithm proposed in [12]. Both of these
the other hand, perhaps J2 ought to be executed first since       problems are special cases of the problem considered in this
it takes roughly ten times as long as J1 , thereby allowing      paper when all of the shared costs are zero.
ten times as many F1 jobs to accumulate for future joint            In practice, the quality of a schedule depends on several
execution with J1 .                                              factors (such as maximum completion time, average com-
   Which of these opposing factors dominates in this case?       pletion time, maximum earliness, maximum lateness). Op-
How can we reason about these issues in general, in order        timizing schedules with respect to several performance met-
to maximize system productivity or minimize average job          rics is known as multicriteria scheduling [9].
response time?                                                      Online scheduling algorithms [18, 20] make scheduling de-
                                                                 cisions without knowledge of future jobs. In non-clairvoyant
                                                                 scheduling [16], the characteristics of the jobs (such as run-
1.2   Contributions and Outline                                  ning time) are not known until the job finishes. Online al-
  In this paper we formalize and study the problem of schedul-   gorithms are typically evaluated using competitive analysis
ing sharable jobs, using a combination of analytical and em-     [18, 20]: if C(I) is the cost of an online schedule on instance
pirical techniques. We demonstrate that scheduling policies      I and Copt (I) is the cost of the optimal schedule, then the
that work well in the traditional context of nonsharable jobs    online algorithm is c-competitive if C(I) ≤ c · Copt (I) + b for
can yield poor schedules in the presence of sharing. We          all instances I and for some constant b.
identify simple policies that do work well in the presence of       Divikaran and Saks [5] studied the online scheduling prob-
sharing, and are robust to fluctuations in the workload such      lem with setup times. In this scenario, jobs belong to job
as bursts of job arrivals.                                       families and a setup cost is incurred whenever the proces-
  The remainder of this paper is structured as follows. We       sor switches between jobs of different families. For example,
discuss related work in Section 2, and give our formal model     jobs in the same family can perform independent scans of
of scheduling jobs with shared scans in Section 3. Then in       the same file, in which case the setup cost is the time it
Section 4 we derive a family of scheduling policies, which       takes to load a file into memory. The problem considered
have some convenient properties that make them practical         in this paper differs in two ways: all jobs executed in one
as we discuss in Section 5. We perform some initial empirical    batch have the same completion time since the scans occur
analysis of our policies in Section 6. Then in Section 7 we      concurrently instead of serially; also, once a batch has been
extend our family of policies to include hybrid ones that        processed, the next batch still has a shared cost even if it is
balance multiple scheduling objectives. We present our final      from the same job family (for example, if the entire file does
empirical evaluation in Section 8.                               not fit into memory).
Given that ts is the dominant cost, for simplicity we treat
                                                                                  i
                                                                   the nonshared execution cost tn as being the same for all
                                                                                                    i
                                                                   jobs in a batch, even though in reality each job may incur a
                                                                   different cost in its custom processing. We verify empirically
                                                                   in Section 6 that nonuniform within-batch processing costs
                                                                   do not throw off our results.
                                                                   3.1    System Workload
                                                                      For the purpose of our analysis we model job arrival as
                                                                   a stationary process (in Section 8.2.2 we study the effect of
                                                                   bursty job arrivals empirically). In our model, for each job
                                                                   family Fi , jobs arrive according to a Poisson process with
 Figure 1: Model: input queues and job executor.                   rate parameter λi .
                                                                      Obviously, a high enough aggregate job arrival rate can
  Stochastic scheduling [15] considers another variation on        overwhelm a given system, regardless of the scheduling pol-
the scheduling problem: the processing time of a job is a          icy. To reason about what job workload a system is capable
random variable, usually with finite mean and variance, and         of handling, it is instructive to consider what happens if jobs
typically only the distribution or some of its moments are         are executed in extremely large batches. In the asymptote,
known. Online versions of these problems for minimizing            as batch sizes approach infinity, the tn values dominate and
expected weighted completion time have also been consid-                ts
                                                                   theP values become insignificant, so system load converges
ered [3, 14, 19] in cases where there is no sharing of work                       n
                                                                   to     i λi · ti . If this quantity exceeds the system’s intrin-
among jobs.                                                        sic processing capacity, then it is impossible to keep queue
3.   MODEL                                                         lengths from growing without bound, and the system can
                                                                   never “catch up” with pending work under any scheduling
   Map-Reduce and related systems execute jobs on large
                                                                   regime. Hence we impose a workload feasibility condition:
clusters, over data files that are spread across many nodes                                              X
(each node serves a dual storage and computation role).                              asymptotic load =     λi · tn < 1
                                                                                                                 i
Large files (e.g., a web crawl, or a multi-day search query and                                         i
result log) are spread across essentially all nodes, whereas
smaller files may only occupy a subset of nodes. Correspond-        3.2    Scheduling Objectives
ingly, jobs that access large files are spread onto the entire         The performance metric we use in this paper is average
cluster, and jobs over small files generally only use a subset      perceived wait time. The perceived wait time (PWT) of job
of nodes.                                                          J is the difference between the system’s response time in
   In this paper we focus on the issue of ordering jobs to         handling J, and the minimum possible response time t(J).
maximize shared scans, rather than the issue of how to al-         (Response time is the total delay between submission and
locate data and jobs onto individual cluster nodes. Hence          completion of a job.)
for the purpose of this paper we abstract away the per-node           As stated in Section 1, the class of systems we consider
details and model the cluster as a single unit of storage and      is geared toward maximizing overall system productivity,
execution. For workloads dominated by large data sets and          rather than committing to response time targets for indi-
jobs that get spread across the full cluster, this abstraction     vidual jobs. This stance would seem to suggest optimizing
is appropriate.                                                    for system throughput. However, in our context maximiz-
   Our model of a data processing engine has two parts: an         ing throughput means maximizing batch sizes, which leads
executor module that processes jobs, and an input queue            to indefinite job wait times. While these systems may find
that holds pending jobs. Each job Ji requires a scan over a        it acceptable to delay some jobs in order to improve overall
(large) input file Fi , and performs some custom processing         throughput, it does not make sense to delay all jobs.
over the content of the file. Jobs can be categorized based            Optimizing for average PWT still gives an incentive to
on their input file into job families, where all jobs that access   batch multiple jobs together when the sharing opportunity
file Fi belong to family Fi . It is useful to think of the input    is large (thereby improving throughput), but not so much
queue as being divided into a set of smaller queues, one per       that the queues grow indefinitely. Furthermore, PWT seems
job family, as shown in Figure 1.                                  like an appropriate metric because it corresponds to users’
   The executor is capable of executing a batch of multiple        end-to-end view of system performance. Informally, average
jobs from the same family, in which case the input file is          PWT can be thought of as an indicator of how unhappy
scanned once and each job’s custom processing is applied           users are, on average, due to job processing delays. Another
over the stream of data generated by scanning the file. For         consideration is the maximum PWT across all jobs, which
simplicity we assume that one batch is executed at a time,         indicates how unhappy the least happy user is.
although our techniques can easily be extended to the case            Our aim is to minimize average PWT, while keeping maxi-
of k simultaneous batches.                                         mum PWT from being excessively high. We focus on steady-
   The time to execute a batch consisting of n jobs from           state behavior, rather than a fixed time period such as one
family Fi equals ts + n · tn , where ts represents the cost of
                     i        i          i                         day, to avoid knapsack-style tactics that “squeeze” short
scanning the input file Fi (i.e., the sharable execution cost),     jobs in at the end of the period. Knapsack-style behav-
and tn represents the custom processing cost incurred by
      i                                                            ior only makes sense in the context of real-time scheduling,
each job (i.e., the nonsharable cost). We assume that ts is i      which is not a concern in the class of systems we study.
large relative to tn , i.e., the jobs are IO-bound as discussed
                   i                                                  For a given job J, PWT can either be measured on an ab-
in Section 1.                                                      solute scale as the difference between the system’s response
symbol      meaning
                                                                          Fi        ith job family
                                                                          ts
                                                                           i        sharable execution time for Fi jobs
                                                                          tn
                                                                           i        nonsharable execution time for Fi jobs
                                                                          λi        arrival rate of Fi jobs
                                                                          Bi        theoretical batch size for Fi
                                                                          ti        theoretical time to execute one Fi batch
                                                                          Ti        theoretical scheduling period for Fi
                                                                          fi        theoretical processing fraction for Fi
                                                                          ωi        perceived wait time for Fi jobs
                                                                          Pi        scheduling priority of Fi
  Figure 2: Ways to measure perceived wait time.                          Bi        queue length for Fi
                                                                          Ti        waiting time of oldest enqueued Fi job
time and the minimum possible response time (e.g., 10 min-
utes), or on a relative scale as the ratio of the system’s re-                          Table 1: Notation.
sponse time to the minimum possible response time (e.g.,
                                                                       Let Pi denote the scheduling priority of family Fi . If there
1.5 × t(J)). (Relative PWT is also known as stretch [17].)
                                                                    is no sharing, SJF sets Pi equal to the time to complete one
  The space of PWT metric variants is shown in Figure 2.
                                                                    job. If there is sharing, then we let Pi equal the average
For convenience we adopt the abbreviations AA, MA, AR
                                                                    per-job execution time of a job batch. Suppose Bi is the
and MR to refer to the four variants.
                                                                    number of enqueued jobs in family Fi , in other words, the
3.3     Scheduling Policy                                           current batch size for Fi . Then the total time to execute a
                                                                    batch is ts + Bi · tn . The average per-job execution time is
                                                                                i         i
   A scheduling policy is an online algorithm that is (re)invoked
                                                                    (ts + Bi · tn )/Bi , which gives us the SJF scheduling priority:
                                                                      i         i
each time the executor becomes idle. Upon invocation, the                                                   „ s       «
policy leaves the executor idle for some period of time (pos-                                                 ti
sibly zero time), and then removes a nonempty subset of                           SJF Policy : Pi = −            + tn
                                                                                                                    i
                                                                                                              Bi
jobs from the input queue, packages them into an execution
batch, and submits the batch to the executor.                         Unfortunately, as we demonstrate empirically in Section 6,
   In this paper, to simplify our analysis we impose two very       SJF does not work well in the presence of sharing. To under-
reasonable restrictions on our scheduling policies:                 stand why, consider a simple example with two job families:

     • No idle. If the input queue is nonempty, do not leave                        F1 : ts = 1, tn = 0, λ1 = a
                                                                                          1       1
       the executor idle. Given the stochastic nature of job                        F2 : ts = a, tn = 0, λ2 = 1
                                                                                          2       2
       arrivals, this policy seems appropriate.
                                                                    for some constant a > 1.
     • Always share. Whenever a job family Fi is scheduled            In this scenario, F2 jobs have long execution time (ts = a)
                                                                                                                           2
       for execution, all enqueued jobs from family Fi are          so SJF schedules F2 infrequently: once every a2 time units,
       included in the execution batch. While it is true that       on expectation. The average perceived wait time under this
       if tn > ts , one achieves lower average absolute PWT         schedule is O(a) due to holding back F2 jobs a long time
       by scheduling jobs sequentially instead of in a batch,       between batches. A policy that is aware of the fact that F2
       in this paper we assume ts > tn , as stated above. If        jobs are relatively rare (λ2 = 1) would elect to schedule F2
       ts > tn it is always beneficial to form large batches, in     more often, and schedule F1 less often but in much larger
       terms of average absolute PWT of jobs in the batch.          batches. In fact, a policy that schedules F2 every a3/2 time
       In all cases, large batches reduce the wait time of jobs     units achieves an average PWT of only O(a1/2 ). For large
       outside the batch that are executed afterward.               a, SJF performs very poorly in comparison.
                                                                      Since SJF does not always produce good schedules in the
                                                                    presence sharing, we begin from first principles. Unfortu-
4.    BASIC SCHEDULING POLICIES                                     nately, as discussed in Section 2.2, solving even the non-
  We derive scheduling policies aimed at minimizing each of         shared scheduling problem exactly is NP-hard. Hence, to
average absolute PWT (Section 4.1) and maximum absolute             make our problem tractable we consider a relaxed version of
PWT (Section 4.2).1                                                 the problem, find an optimal solution to the relaxed prob-
  The notation we use in this section is summarized in Ta-          lem, and apply this solution to the original problem.
ble 1.
                                                                    4.1.1    Relaxation 1
4.1     Average Absolute PWT                                           In our initial, simple relaxation, each job family (each
  If there is no sharing, low average absolute PWT is achieved      queue in Figure 1) has a dedicated executor. The total work
via shortest-job-first (SJF) scheduling and its variants. (In        done by all executors, in steady state, is constrained to be
a stochastic setting, the generalization of SJF is asymptoti-       less than or equal to the total work performed by the one
cally optimal [3].) We generalize SJF to the case of sharable       executor in the original problem. Furthermore, rather than
jobs as follows.                                                    discrete jobs, in our relaxation we treat jobs as continuously
1
  We tried deriving policies that directly aim to minimize rel-     arriving, infinitely divisible units of work.
ative PWT, but the resulting policies did not perform well,            In steady state, an optimal schedule will exhibit periodic
perhaps due to breakdowns in the approximation schemes              behavior: For each job family Fi , wait until Bi jobs have
used to derive the policies.                                        arrived on the queue and execute those Bi jobs as a batch.
Given the arrival rate λi , on expectation a new batch is            • Old jobs. jobs that are already in the queue when
executed every Ti = Bi /λi time units. A batch takes time              the Fi batch is executed, are also delayed. Under
ti = ts + Bi · tn to complete. The fraction of time Fi ’s
      i          i                                                     Relaxation 1, the expected number of such jobs is
                                                                       P
executor is in use (rather than idle), is fi = ti /Ti .                    j=i (Tj · λj )/2. The delay incurred to each one is
   We arrive at the following optimization problem:                    ti , making the overall delay incurred to other in-queue
               X                  X           AA                       jobs equal to
                   fi ≤ 1     min      λi · ω i
                i                       i                                                       ti X
                                                                                           D3 =   ·    (Tj · λj )
          AA                                                                                    2
where ωi is the average absolute PWT for jobs in Fi .                                              j=i
   There are two factors that contribute to the PWT of a
newly-arrived job: (1) the delay until the next batch is           The total delay imposed on other jobs per unit time is
formed (2) the fact that a batch of size Bi takes longer to      proportional to 1/Ti · (D1 + D2 + D3 ). If we minimize the
finish than a singleton batch. The expected value of Factor       sum of this quantity across all families Fi , again subject
                                                                                                             P
1 is Ti /2. Factor 2 equals (Bi − 1) · tn . Overall,
                                        i
                                                                 to the resource utilization constraint         i fi ≤ 1 using the
                                                                 Lagrange method, we obtain the following invariant across
                     AA   Ti
                    ωi =     + (Bi − 1) · tn
                                           i                     job families:
                          2                                                                                                    !
                                                                       2                     2
   We solve the above optimization problem using the method          Bi       s
                                                                                  X        Bi             n        n
                                                                                                                     X
                                                                           − ti ·   λj +         · (λi · ti ) · ti ·    λj + 1
of Lagrange Multipliers. In the optimal solution the follow-       λi · ts
                                                                         i               λi · ts
                                                                                               i
                                                                                  j                                   j
ing quantity is constant across all job families Fi :
                        2                                          The scheduling policy resulting from this invariant does
                     Bi
                            · (1 + 2 · λi · tn )
                                             i                   achieve the hoped-for O(a1/2 ) average PWT in our example
                    λi · ts
                          i                                      two-family scenario.
Given the λ, ts and tn values, one can select batch sizes (B
values) accordingly.                                             4.1.3    Implementation and Intuition
                                                                                                                P          n
                                                                    Recall the workload feasibility condition      i λi · ti < 1
4.1.2    Relaxation 2                                            from Section 3.1. If the executor’s load is spread across a
   Unfortunately, the optimal solution to Relaxation 1 can       large number of job families, then for each Fi , λi ·tn is small.
                                                                                                                       i
differ substantially from the optimal solution to the origi-      Hence, it is reasonable to drop the terms involving λi · tn    i
nal problem. Consider the simple two-family example we           from our above formulae, yielding the following simplified
presented earlier in Section 4.1. The optimal policy under       invariants2 :
Relaxation 1 schedules job families in a round robin fashion,
yielding an average PWT of O(a). Once again this result is           • Relaxation 1 result: For all job families Fi , the
much worse than the achievable O(a1/2 ) value we discussed             following quantity is equal:
earlier.                                                                                          Bi 2
   Whereas SJF errs by scheduling F2 too infrequently, the
                                                                                                 λi · ts
                                                                                                       i
optimal Relaxation 1 policy errs in the other direction: it
schedules F2 too frequently. Doing so causes F1 jobs to wait         • Relaxation 2 result: For all job families Fi , the
behind F2 batches too often, hurting average wait time.                following quantity is equal:
   The problem is that Relaxation 1 reduces the original
                                                                                              2
scheduling problem to a resource allocation problem. Under                                 Bi            X
                                                                                                  − ts ·
                                                                                                     i     λj
Relaxation 1, the only interaction among job families is fact                                   s
                                                                                          λi · ti
                                                   P                                                     j
that they must share the overall processing time ( i fi ≤ 1).
In reality, resource allocation is not the only important con-      A simple way to translate these statements into imple-
sideration. We must also take into account the fact that the     mentable policies is as follows: Assign a numeric priority
execution batches must be serialized into a single sequen-       Pi to each job family Fi . Every time the executor becomes
tial schedule and executed on a single executor. When a          idle schedule the family with the highest priority, as a sin-
long-running batch is executed, other batches must wait for      gle batch of Bi jobs, where Bi denotes the queue length for
a long time.                                                     family Fi . If we are in steady state, then Bi should roughly
   Consider a job family Fi , for which a batch of size Bi is    equal Bi . This observation suggests the following priority
executed once every Ti time units. Whenever an Fi batch          values for the scheduling policies implied by Relaxations 1
is executed, the following contributions to PWT occur:           and 2, respectively:
   • In-batch jobs. The Bi Fi jobs in the current batch
     are delayed by (Bi − 1) · tn time units each, for a total
                                i                                                                    Bi 2

     of D1 = Bi · (Bi − 1) · tn time units.                               AA Policy 1 : Pi =
                              i                                                                     λi · ts
                                                                                                          i
                                                                                                        2
   • New jobs. jobs that arrive while the Fi batch is being                                          Bi            X
                                                                          AA Policy 2 : Pi =                − ts ·
                                                                                                               i     λj
     executed, are delayed. The expected number of such                                                   s
                                                                                                    λi · ti
                 P                                                                                                 j
     jobs is ti · j λj . The delay incurred to each one is
     ti /2 on average, making the overall delay incurred to
     other new jobs equal to
                                                                 2
                              t2 X                                There are also practically-motivated reasons to drop terms
                         D2 = i ·    λj                          involving tn , as we discuss in Section 5. In Section 6 we give
                              2    j                             empirical justification for dropping the tn terms.
These formulae have a fairly simple intuitive explanation.    This policy can be thought of as FIFO applied to job family
First, if many new jobs with a high degree of sharing are        batches, since it schedules the family of the job that has
expected to arrive in the future (λi · ts in the denomina-
                                          i                      been waiting the longest.
tor, which we refer to as the sharability of family Fi ), we
should postpone execution of Fi and allow additional jobs
to accumulate into the same batch, so as to achieve greater      5.    PRACTICAL CONSIDERATIONS
sharing with little extra waiting. On the other hand, as the        The scheduling policies we derived in Section 4 rely on
                                              2                  several parameters related to job execution cost and job ar-
number of enqueued jobs becomes large (Bi in the numer-
ator), the execution priority increases quadratically, which     rival rates. In this section we explain how these parameters
eventually forces the execution of a batch from family Fi to     can be obtained in practice.
avoid imposing excessive delay on the enqueued jobs.
   Policy 2 has an extra subtractive term, which penalizes       Robust cost estimation: The fact that we were able to
long batches (i.e., ones with largePs ) if the overall rate of
                                    t                            drop the nonsharable execution time tn from our scheduling
arrival of jobs is high (i.e., high                              priority formulae not only keeps them simple, it also means
                                      j λj ). Doing so allows
short batches to execute ahead of long batches, in the spirit    that the scheduler does not need to estimate this quantity.
of shortest-job-first.                                            In practice, estimating the full execution time of a job accu-
   For singleton job families (families with just one job),      rately can be difficult, especially in the Map-Reduce context
ts = 0 and the priority value Pi goes to infinity. Hence          in which processing is specified via opaque user-defined func-
 i
nonsharable jobs are to be scheduled ahead of sharable ones.     tions. (In Section 6 we verify empirically that the perfor-
The intuition is that nonsharable jobs cannot be beneficially     mance of our policies is not sensitive to whether the factors
coexecuted with future jobs, so we might as well execute         involving tn are included.)
them right away. If there are multiple nonsharable jobs, ties       Our formulae do require estimates of the sharable exe-
can be broken according to shortest-job-first.                    cution time ts , i.e., the IO cost of scanning the input file.
                                                                 For large files, this cost is nearly linearly proportional to
4.2     Maximum Absolute PWT                                     the size of the input file, a quantity that is easy to obtain
   Here, instead of optimizing for average absolute PWT,         from system metadata. (The proportionality constant can
we optimize for the maximum. We again adopt a relaxation         be dropped, as linear scaling of the ts values does not affect
of the original problem that assumes parallel executors and      our priority-based scheduling policies.)
infinitely divisible work. Under the relaxation, the objective
                                                                 Dynamic estimation of arrival rates: Some of our pri-
function is:
                                                                 ority formulae contain λ values, which denote job arrival
                                   MA                            rates. Under the Poisson model of arrival, one can estimate
                          min max ωi
                               i
                                                                 the λ values dynamically, by keeping a time-decayed count
where    MA
        ωi  is the maximum absolute PWT for Fi jobs.             of arrivals. In this way the arrival rate estimates (λ values)
   As stated in Section 4.1.1 there are two factors that con-    automatically adjust as the workload shifts over time. (See
tribute to the PWT of a newly-arrived job: (1) the delay         Section 6.1 for details.)
until the next batch is formed (2) the fact that a batch of
size Bi takes longer to finish than a singleton batch. The
maximum values of these factors are Ti and (Bi − 1) · tn ,
                                                                 6.    BASIC EXPERIMENTS
                                                           i
respectively. Overall,                                             In this section we present experiments that:
                                                                  • Justify ignoring the nonsharable execution time compo-
                  ωi = Ti + (Bi − 1) · tn
                   MA
                                        i                           nent tn in our scheduling policies (Section 6.2).
or, written differently:                                           • Compare our scheduling policy variants empirically (Sec-
                                                                     tion 6.3).
                ωi = Ti · (1 + λi · tn ) − tn
                 MA
                                     i      i                    (We compare our policies against baseline policies in Sec-
                          MA
In the optimal solution ωi is constant across all job fam-       tion 8.)
ilies Fi . The intuition behind this result is that if one of
      MA
the ωi values is larger than the others, we can decrease it      6.1    Experimental Setup
                                     MA                            We built a simulator and a workload generator. Our work-
somewhat by increasing the other ωi values, thereby re-
ducing the maximum PWT. Hence in the optimal solution            load consists of 100 job families. For each job family, the
      MA
all ωi values are equal.                                         sharable cost ts is generated from the heavy-tailed distri-
                                                                 bution 1 + |X |, where X is a Cauchy random variable. For
4.2.1    Implementation and Intuition                            greater realism, the nonsharable cost tn is on a per-job basis,
   As justified in Section 4.1.3, we drop terms involving λi ·    rather than a per-family basis as in our model in Section 3.
tn from our ω MA formula and obtain ω MA ≈ Ti − tn . As            In our default workload, each time a job arrives, we select
 i                                                     i
stated in Section 3, we assume the tn values to be a small       a nonshared cost randomly as follows: with probability 0.6,
component of the overall job execution times, so we also drop    tn = 0.1 · ts ; with probability 0.2, tn = 0.2 · ts ; with prob-
the −tn term and arrive at the approximation ω MA ≈ Ti .         ability 0.2, tn = 0.3 · ts . (The scenario we focus on in this
       i
   Let Ti denote the waiting time of the oldest enqueued Fi      paper is one in which the shared cost dominates, because it
job, which should roughly equal Ti in steady state. We use       represents IO and jobs tend to be IO-bound, as discussed in
Ti as the basis for our priority based scheduling policy:        Section 3.) In some of our experiments we deviate from this
                                                                 default workload and study what happens when tn tends to
              MA Policy(FIFO) :         Pi = Ti                  be larger than ts .
700                                                                            1600
                      tn-ignorant policy for AA PWT                                                        AA Policy 2, lambda known
           600           tn-aware policy for AA PWT                                       1400               AA Policy 1, est. lambda
                                                                                          1200               AA Policy 2, est. lambda
           500
                                                                                                           AA Policy 1, known lambda
                                                                                          1000
  AA PWT




                                                                   AA PWT
           400
                                                                                           800
           300
                                                                                           600
           200                                                                             400
           100                                                                             200
             0                                                                                 0
                 1         5      10            33    66 100                                       0           1        2          3         4 4.3
                          Shared cost divisor                                                                        Shared cost skew

Figure 3: tn -awareness versus tn -ignorance for AA               Figure 5: AA Policy 1 versus AA Policy 2, varying
Policy 2.                                                         shared cost skew.

           2500                                                                           30
                      tn-ignorant policy for MA PWT                                                    AA Policy 2, est. lambda
                         tn-aware policy for MA PWT                                                    AA Policy 1, est. lambda




                                                                   Average Absolute PWT
                                                                                          25
           2000
                                                                                          20
  MA PWT




           1500
                                                                                          15
           1000
                                                                                          10
            500                                                                            5

              0                                                                            0
                  1         5      10           33    66 100                                    1      5      10            20          30           40
                          Shared cost divisor                                                                               Skew


Figure 4: tn -awareness versus tn -ignorance for MA               Figure 6: AA Policy 1 versus AA Policy 2, varying
Policy.                                                           shared cost skew, λi ts = const.
                                                                                        i


   Job arrival events are generated using the standard ho-        job instance in the queue.)
mogenous Poisson point process [2]. Each job family Fi has           Figures 3 and 4 plot the performance of the tn -aware and
an arrival parameter λi which represents the expected num-        tn -ignorant variants of our policies (AA Policy 2 and MA
ber of jobs that arrive in one unit of time. There are 500, 000   Policy, respectively) as we vary the magnitude of the shared
units of time in each run of the experiments. The λi values       cost (keeping the tn distribution and λ values fixed). In
are initially chosen from a Pareto distribution with parame-      both graphs, the y-axis plots the metric the policy is tuned
ter α = 1.9 and then are rescaled so that i λi E[tn ] = load.
                                            P
                                                     i            to optimize (AA PWT and MA PWT, respectively). The
The total asymptotic system load ( λi ·tn ) is 0.5 by default.
                                     P
                                            i                     x-axes plot the shared cost divisor, which is the factor by
   Some of our scheduling policies require estimation of the      which we divided all shared costs. When the shared cost
job arrival rate λi . To do this, we maintain an estimate Ii      divisor is large (e.g., 100), the ts values become quite small
of the difference in the arrival times of the next two jobs        relative to the tn values, on average.
in family Fi . We adjust Ii as new job arrivals occur, by            Even when nonshared costs are large relative to shared
taking a weighted average of our previous estimate Ii and         costs (right-hand side of Figures 3 and 4), tn -awareness has
Ai , the difference in arrival times of the two most recent jobs   little impact on performance. Hence from this point forward
from Fi . Formally, the update step is Ii ← 0.05Ai + 0.95Ii .     we only consider the simpler, tn -ignorant variants of our
Given Ii and the time t since the last arrival of a job in Fi ,   policies.
                                              1
we estimate λi as 1/Ii if t < Ii and as 0.05+0.95Ii otherwise.
                                                                  6.3                     Comparison of Policy Variants
6.2        Influence of Nonshared Execution Time
  In our first set of experiments, we measure how knowl-           6.3.1                        Relaxation 1 versus Relaxation 2
edge of tn affects our scheduling policies. Recall that in            We now turn to a comparison of AA Policy 1 versus AA
Sections 4.1.3 and 4.2.1 we dropped tn from the priority          Policy 2 (recall that these are based on Relaxation 1 (Sec-
formulae, on the grounds that the factors involving tn are        tion 4.1.1) and Relaxation 2 (Section 4.1.2) of the original
small relative to other factors. To validate ignoring tn in our   AA PWT minimization problem, respectively). Figure 5
scheduling policies, we compare tn -aware variants (which         shows that the two variants exhibit nearly identical perfor-
use the full formulae with tn values) against the tn -ignorant    mance, even as we vary the skew in the shared cost (ts ) dis-
variants presented in Sections 4.1.3 and 4.2.1. (The tn -aware    tribution among job families (here there are five job families
variants are given knowledge of the precise tn value of each      Fi with shared cost ts = iα , where α is the skew parameter).
                                                                                       i
6.4     Summary of Findings
                                                                      The findings from our basic experiments are:
                                                                         • Estimating the arrival rates (λ values) online, as op-
                                                                           posed to knowing them from an oracle, does not hurt
                                                                           performance.
                                                                         • It is not necessary to incorporate tn estimates into the
                                                                           priority functions.
                                                                         • AA Policy 2 (which is based on Relaxation 2) domi-
                                                                           nates AA Policy 1 (based on Relaxation 1).
                                                                      From this point forward, we use tn -ignorant AA Policy 2
                                                                    with online λ estimation.

                                                                    7.    HYBRID SCHEDULING POLICIES
                                                                       The quality of a scheduling policy is generally evaluated
                                                                    using several criteria [9] and so optimizing for either the av-
Figure 7: Relative effectiveness of different priority                erage or maximum perceived wait time, as in Section 4, may
formula variants.                                                   be too extreme. If we optimize solely for the average, there
   However, if we introduce the invariant that λi · ts (which       may be certain jobs with very high PWT. Conversely if we
                                                     i
represents the “sharability” of jobs in family Fi ; see Sec-        optimize solely for the maximum, we end up punishing the
tion 4.1.3) remain constant across all job families Fi , a dif-     majority of jobs in order to help a few outlier jobs. In prac-
ferent picture emerges. Figure 6 shows the result of varying        tice it may make more sense to optimize for a combination
the shared cost skew, as we hold λi · ts constant across job        of average and maximum PWT. A simple approach is to
                                        i
families. (Here there are two job families: ts = λ1 = 1 and         optimize for a linear combination of the two:
                                             2
ts = λ2 = skew parameter (x-axis).) In this case, we see a
                                                                                       X
 1                                                                                min       α · ωi + (1 − α) · ωi A
                                                                                                 AA             M
clear difference in performance between the policies based                                i
on the two relaxations, with the one based on Relaxation 2                   AA
(AA Policy 2) performing much better.                               where ω     denotes average absolute PWT and ω M A denotes
   Overall, it appears that AA Policy 2 dominates AA Policy         maximum absolute PWT. The parameter α ∈ [0, 1] denotes
1, as expected. As to whether the case in which AA Policy           the relative importance of having low average PWT versus
2 performs significantly better than AA Policy 1 is likely           low maximum PWT.
to occur in practice, we do not know. Clearly, using AA               We apply the methods used in Section 4 to the hybrid
Policy 2 is the safest option, and besides it is not much           optimization objective, resulting in the following policy:
more complex to implement than AA Policy 1.                                                                    quot;                      #
                                                                                                                    2
                                                                                                         1       Bi            X
                                                                    Hybrid Policy : Pi = α ·             P    ·       s
                                                                                                                        − ts ·
                                                                                                                           i     λj
                                                                                                    2·   j λj   λi · ti        j
6.3.2    Use of Different Estimators                                                                               Ti2
   Recall that our AA Policies 1 and 2 (Section 4.1.3) have a                                   + xi · (1 − α) ·
  2
                                                                                                                   ts
                                                                                                                    i
Bi /λi term. In the model assumed by Relaxation 1, using
the equivalence Bi = Ti · λi , we can rewrite this term in
                                                                      where xi = 1 if Ti = maxj Tj , and xi = 0 otherwise.
four different ways: Bi /λi (using batch size), Ti2 λi (using
                          2

waiting time), Bi ·Tˆ (the geometric mean of the two previous
                      i
                                                                      The hybrid policy degenerates to the nonhybrid policies
options), and max Bi /λi , Ti2 · λi .
                        2
                                    ˜
                                                                    of Section 4 if we set α = 0 or α = 1. For intermediate
   In Figure 7 we compare these variants, and also compare
                                                                    values of α, job families receive the same relative priority
using the true λ values versus using an online estimator for
                                                                    as they would under the average PWT regime, except the
λ as described in Section 6.1. We used a more skewed non-
                                                                    family that has been waiting the longest (i.e., the one with
shared cost (tn ) distribution than in our other experiments,
                                                                    xi = 1), which gets an extra boost in priority. This “extra
to get a clear separation of the variants. In particular we
                                                                    boost” reduces the maximum wait time, while raising the
used: with probability 0.6, tn = 0.1 · ts ; with probability 0.2,
                                                                    average wait time a bit.
tn = 0.2·ts ; with probability 0.1, tn = 0.5·ts ; with probabil-
ity 0.1, tn = 1.0·ts . We generated 20 sample workloads, and
for each workload we computed the best AA PWT among                 8.    FURTHER EXPERIMENTS
the policy variants. For each policy variant, Figure 7 plots          We are now ready for further experiments. In particular
the fraction of times the policy variant had an AA PWT that         we study:
was more than 3% worse than the best AA PWT for each                 • The behavior of our hybrid policy (Section 8.1).
                                                            2
workload. The result is that the variant that uses Bi /λi
                                                                     • The performance of our policies compared to baseline
(the form given in Section 4.1.3) clearly outperforms the
                                                                       policies (Section 8.2.1).
rest. Furthermore, estimating the arrival rates (λ values)
works fine, compared to knowing them in advance via an                • The ability to cope with large bursts of job arrivals (Sec-
oracle.                                                                tion 8.2.2).
420                                            4500                                               25000
                                                  AA PWT                                                                                    FIFO
                         400                                            4000
                                                  MA PWT                                                                            AA Policy 2
  Average Absolute PWT




                                                                                                    Average Absolute PWT
                         380                                                                                               20000




                                                                               Max Absolute PWT
                         360                                            3500                                                              Hybrid
                                                                                                                                   Oblivious SJF
                         340                                            3000                                               15000     Aware SJF
                         320
                                                                        2500
                         300
                         280                                            2000                                               10000
                         260                                            1500
                         240                                                                                               5000
                         220                                            1000
                         200                                            500                                                    0
                               0   0.20 0.40 0.60 0.80 0.95 0.99 1.00                                                              0.1      0.3      0.5      0.7     0.9
                                           Hybrid Parameter                                                                                Load (lambda increasing)

Figure 8: Hybrid Policy performance on average and                                                Figure 10: Policy performance on AA PWT met-
maximum absolute PWT, as we vary the hybrid pa-                                                   ric, as job arrival rates increase (both SJF variants
rameter α.                                                                                        shown).

                         200                                            2000
                                                                                                  α = 0.99 against two generalizations of shortest-job-first
                                                  AR PWT                                          (SJF): The policy “Aware SJF” is the one given in Sec-
                         180                      MR PWT                1800                      tion 4.1, which knows the nonshared cost of jobs in its queue,
  Average Relative PWT




                                                                               Max Relative PWT
                         160                                            1600                      and chooses the job family for which it can execute the most
                                                                                                  number of jobs per unit of time (i.e., the family that min-
                         140                                            1400                      imizes (batch execution cost)/B). By a simple interchange
                         120                                            1200                      argument it can be shown that this policy is optimal for the
                                                                                                  case when jobs have stopped arriving. The policy “Oblivi-
                         100                                            1000
                                                                                                  ous SJF” does not know the nonshared cost of jobs and so it
                          80                                            800                       chooses the family for which ts /B is minimized. This policy
                          60                                            600                       is optimal for the case when jobs have stopped arriving and
                               0   0.20 0.40 0.60 0.80 0.95 0.99 1.00                             the nonshared costs are small.
                                                                                                     In these experiments we tested how these policies are af-
                                           Hybrid Parameter
                                                                                                  fected by the total load placed on the system. (Recall from
                                                                                                                                           λi · tn .) To vary load,
                                                                                                                                       P
                                                                                                  Section 3.1 that asymptotic load =             i
Figure 9: Hybrid Policy performance on average and                                                we started with workloads with asymptotic load = 0.1, and
maximum relative PWT, as we vary α.                                                               then caused load to increase by various increments, in one
                                                                                                  of two ways: (1) increase the nonshared costs (tn values), or
8.1                      Hybrid Policy                                                            (2) increase the job arrival rates (λ values). In both cases,
   Figure 8 shows the performance of our Hybrid Policy (Sec-                                      all other workload parameters are held constant.
tion 7), in terms of both average and maximum absolute                                               In Section 8.2.1 we report results for the case where job
PWT. Figure 9 shows the same thing, but for relative PWT.                                         arrivals are generated by a homogeneous Poisson point pro-
In both graphs the x-axis plots the hybrid parameter α (this                                      cess. In Section 8.2.2 we report results under bursty arrivals.
axis is not on a linear scale, for the purpose of presentation).
The decreasing curve plots average PWT, whose scale is on                                         8.2.1                      Stationary Workloads
the left-hand y-axis; the increasing curve plots maximum                                             In Figure 10 we plot AA PWT as the job arrival rate,
PWT, whose scale is on the right-hand y-axis.                                                     and thus total system load, increases. It is clear that Aware
   With α = 0, the hybrid policy behaves like the MA Pol-                                         SJF has terrible performance. The reason is as follows: In
icy (FIFO), which achieves low maximum PWT at the ex-                                             our workload generator, expected nonshared costs are pro-
pense of very high average PWT. On the other extreme,                                             portional to shared costs (e.g., the cost of a CPU scan of
with α = 1 it behaves like the AA Policy, which achieves                                          the file is roughly proportional to its size on disk). Hence,
low average PWT but very high maximum PWT. Using in-                                              Aware SJF has a very strong preference for job families with
termediate values of α trades off the two objectives. In both                                      small shared cost (essentially ignoring the batch size), which
the absolute and relative cases, a good balance is achieved                                       leads to starvation of ones with large shared cost.
at approximately α = 0.99: maximum PWT is only slightly                                              In the rest of our experiments we drop Aware SJF, so we
higher than with α = 0, and average PWT is only slightly                                          can focus on the performance differences among the other
higher than with α = 1.                                                                           policies. Figure 11 is the same as Figure 10, with Aware
   Basically, when configured with α = 0.99, the Hybrid Pol-                                       SJF removed and the y-axis re-scaled. Here we see that AA
icy mimics the AA Policy most of the time, but makes an                                           Policy 2 and the Hybrid Policy outperform both FIFO and
exception if it notices that one job has been waiting for a                                       SJF, especially at higher loads.
very long time.                                                                                      In Figure 12 we show the corresponding graph with MA
                                                                                                  PWT on the y-axis. Here, as expected, FIFO and the Hy-
8.2                      Comparison Against Baselines                                             brid Policy perform very well.
  In the following experiments, we compare the policies AA                                           Figures 13 and 14 show the corresponding plots for the
Policy 2, MA Policy (FIFO), and the Hybrid Policy with                                            case where load increases due to a rise in nonshared cost.
3000                                                                      1800
                                         FIFO                                                                      FIFO
                                                                                                  1600
                                 AA Policy 2                                                               AA Policy 2
 Average Absolute PWT




                                                                           Average Absolute PWT
                        2500
                                       Hybrid                                                     1400           Hybrid
                        2000    Oblivious SJF                                                     1200    Oblivious SJF
                                                                                                  1000
                        1500
                                                                                                   800
                        1000                                                                       600
                                                                                                   400
                         500
                                                                                                   200
                           0                                                                         0
                                0.1      0.3      0.5       0.7     0.9                                   0.1         0.3      0.5       0.7         0.9
                                        Load (lambda increasing)                                                 Load (non-shared cost increasing)

Figure 11: Policy performance on AA PWT metric,                           Figure 13: Policy performance on AA PWT metric,
as job arrival rates increase.                                            as nonshared costs increase.

                        70000                                                                     40000
                                          FIFO                                                                      FIFO
                        60000     AA Policy 2                                                     35000     AA Policy 2
 Max Absolute PWT




                                        Hybrid




                                                                           Max Absolute PWT
                        50000                                                                     30000           Hybrid
                                 Oblivious SJF                                                             Oblivious SJF
                        40000                                                                     25000
                                                                                                  20000
                        30000
                                                                                                  15000
                        20000
                                                                                                  10000
                        10000
                                                                                                  5000
                            0                                                                         0
                                 0.1      0.3      0.5      0.7     0.9                                    0.1        0.3       0.5       0.7        0.9
                                         Load (lambda increasing)                                                Load (non-shared cost increasing)

Figure 12: Policy performance on MA PWT metric,                           Figure 14: Policy performance on MA PWT metric,
as job arrival rates increase.                                            as nonshared costs increase.
These graphs are qualitatively similar to Figures 11 and 12,              increases via increasing non-shared costs. Here, SJF slightly
but the differences among the scheduling policies are less                 outperforms our policies on AA PWT, but our Hybrid Policy
pronounced.                                                               performs well on both average and maximum PWT.
   Figures 15, 16, 17 and 18 are the same as Figures 11,                    Figure 21 shows average absolute PWT as the job arrival
12, 13 and 14, respectively, but with the y-axis measuring                rate increases, while keeping the nonshared cost distribution
relative PWT. If we are interested in minimizing relative                 constant. Here AA Policy 2 and Hybrid slightly outperform
PWT, our policies, which aim to minimize absolute PWT,                    SJF.
do not necessarily do as well as SJF. Devising policies that                To visualize the temporal behavior in the presence of bursts,
specifically optimize for relative PWT is an important topic               Figure 22 shows a moving average of absolute PWT on the
of future work.                                                           y-axis, with time plotted on the x-axis. This time series is a
                                                                          sample realization of the experiment that produced Figure
8.2.2                     Bursty Workloads                                19, with load = 0.7.
   To model bursty job arrival behavior we use two different                 Since our policies focus on exploiting job arrival rate (λ)
Poisson processes for each job family. One Poisson process                estimates, it is not surprising that under extremely bursty
corresponds to a low arrival rate and the other corresponds               workloads where there is no semblance of a steady-state λ,
to an arrival rate that is ten times as fast. We switch be-               they do not perform as well relative to the baselines as un-
tween these processes using a Markov process: after a job                 der stationary workloads (Section 8.2.1). However, it is re-
arrives, we switch states (from high arrival rate to low ar-              assuring that our Hybrid Policy does not perform noticeably
rival rate or vice versa) with probability 0.05, and stay in              worse than shortest-job-first, even under these extreme con-
the same state with probability 0.95. The initial probability             ditions.
of either state is the stationary distribution of this process
(i.e. with probaility 0.5 we start with a high arrival rate).             8.3                     Summary of Findings
The expected number of jobs coming from bursts is the same
                                                                           The findings from our experiments on the absolute PWT
as the expected number of jobs not coming from bursts. If
                                                                          metric which our policies are designed to optimize, are:
λi is the arrival rate for the non-burst process, then the ex-
pected λi (number of jobs per second) asymptotically equals
20λi /11. Thus the load is i E[λi ]E[tn ].
                             P
                                         i                                               • Our MA Policy (a generalization of FIFO to shared
   In Figures 19 and 20 we show the average and maximum                                    workloads) is the best policy on maximum PWT, but
absolute PWT, respectively, for bursty job arrivals as load                                performs poorly on average PWT, as expected.
1200                                                                       400
                                         FIFO                                                                       FIFO
                                 AA Policy 2                                                        350     AA Policy 2
  Average Relative PWT




                                                                             Average Relative PWT
                         1000
                                       Hybrid                                                       300           Hybrid
                          800   Oblivious SJF                                                              Oblivious SJF
                                                                                                    250
                          600                                                                       200
                                                                                                    150
                          400
                                                                                                    100
                          200
                                                                                                     50
                            0                                                                         0
                                0.1      0.3      0.5       0.7    0.9                                     0.1        0.3       0.5        0.7        0.9
                                        Load (lambda increasing)                                                  Load (non-shared cost increasing)

Figure 15: Policy performance on AR PWT metric,                             Figure 17: Policy performance on AR PWT metric,
as job arrival rates increase.                                              as nonshared costs increase.

                         8000                                                                       4000
                                         FIFO                                                                        FIFO
                         7000                                                                       3500     AA Policy 2
                                 AA Policy 2




                                                                             Max Relative PWT
                                                                                                                   Hybrid
  Max Relative PWT




                         6000          Hybrid                                                       3000
                                Oblivious SJF                                                               Oblivious SJF
                         5000                                                                       2500
                         4000                                                                       2000
                         3000                                                                       1500
                         2000                                                                       1000
                         1000                                                                        500
                            0                                                                          0
                                0.1      0.3      0.5       0.7    0.9                                      0.1        0.3       0.5       0.7        0.9
                                        Load (lambda increasing)                                                  Load (non-shared cost increasing)


Figure 16: Policy performance on MR PWT metric,                             Figure 18: Policy performance on MR PWT metric,
as job arrival rates increase.                                              as nonshared costs increase.

                • Our Hybrid Policy, if properly tuned, achieves a “sweet   a combination of the two objectives. Compared with the
                  spot” in balancing average and maximum PWT, and           baseline shortest-job-first and FIFO policies, which do not
                  is able to perform quite well on both.                    account for future sharing opportunities, our policies achieve
                                                                            significantly lower perceived wait time. This means that
                • With stationary workloads, our Hybrid Policy substan-     users’ jobs will generally complete earlier under our schedul-
                  tially outperforms the better of two generalizations of   ing policies.
                  shortest-job-first to shared workloads.
                • With extremely bursty workloads, our Hybrid Policy
                  performs on par with shortest-job-first.
                                                                            10.                     REFERENCES
                                                                             [1] R. H. Arpaci-Dusseau. Run-time adaptation in River.
                                                                                 ACM Trans. on Computing Systems, 21(1):36–86, Feb.
9.                       SUMMARY                                                 2003.
   In this paper we studied how to schedule jobs that can                    [2] P. Billingsley. Probability and Measure. John Wiley &
share scans over a common set of input files. The goal is to                      Sons, Inc., New York, 3nd edition, 1995.
amortize expensive file scans across many jobs, but without                   [3] M. C. Chou, H. Liu, M. Queyranne, and
unduly hurting individual job response times.                                    D. Simchi-Levi. On the asymptotic optimality of a
   Our approach builds a simple stochastic model of job                          simple on-line algorithm for the stochastic
arrivals for each input file, and takes into account antici-                      single-machine weighted completion time problem and
pated future jobs while scheduling jobs that are currently                       its extensions. Operations Research, 54(3):464–474,
enqueued. The main idea is as follows: If an enqueued job                        2006.
J requires scanning a large file F , and we anticipate the                    [4] J. Dean and S. Ghemawat. MapReduce: Simplified
near-term arrival of additional jobs that also scan F , then it                  data processing on large clusters. In Proc. OSDI, 2004.
may make sense to delay J if it has not already waited too                   [5] S. Divakaran and M. Saks. Online scheduling with
long and other, less sharable, jobs are available to run.                        release times and set-ups. Technical Report 2001-50,
   We formalized the problem and derived a simple and ef-                        DIMACS, 2001.
fective scheduling policy, under the objective of minimizing                 [6] P. M. Fernandez. Red brick warehouse: A read-mostly
perceived wait time (PWT) for completion of user jobs. Our                       RDBMS for open SMP platforms. In Proc. ACM
policy can be tuned for average PWT, maximum PWT, or                             SIGMOD, 1994.
9000                                                                                 2500
                                            FIFO                                                                             FIFO
                        8000
                                    AA Policy 2                                                                      AA Policy 2
 Average Absolute PWT




                                                                                      Average Absolute PWT
                        7000                                                                                 2000
                                          Hybrid                                                                           Hybrid
                        6000       Oblivious SJF                                                                    Oblivious SJF
                                                                                                             1500
                        5000
                        4000
                                                                                                             1000
                        3000
                        2000                                                                                  500
                        1000
                           0                                                                                    0
                                   0.1        0.3       0.5       0.7          0.9                                  0.1       0.3       0.5        0.7       0.9
                                         Load (non-shared cost increasing)                                          Load with bursts (non-shared cost increasing)

Figure 19: Policy performance on AA PWT metric,                                      Figure 21: Policy performance on AA PWT metric,
as nonshared costs increase, with bursty job arrivals.                               as arrival rates increase, with bursty job arrivals.


                        140000
                                              FIFO
                        120000        AA Policy 2
 Max Absolute PWT




                                            Hybrid
                        100000
                                     Oblivious SJF
                        80000
                        60000
                        40000
                        20000
                               0
                                     0.1        0.3      0.5       0.7         0.9
                                           Load (non-shared cost increasing)

Figure 20: Policy performance on MA PWT metric,
as nonshared costs increase, with bursty job arrivals.
                                                                                     Figure 22: Performance over time, with bursty job
                                                                                     arrivals.
 [7] A. Gupta, S. Sudarshan, and S. Vishwanathan. Query
     scheduling in multiquery optimization. In                                       [15] R. H. M¨hring, F. J. Radermacher, and G. Weiss.
                                                                                                   o
     International Symposium on Database Engineering                                      Stochastic scheduling problems I – general strategies.
     and Applications (IDEAS), 2001.                                                      Mathematical Methods of Operations Research,
 [8] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki.                                     28(7):193–260, 1984.
     QPipe: A simultaneously pipelined relational query                              [16] R. Motwani, S. Phillips, and E. Torng.
     engine. In Proc. ACM SIGMOD, 2005.                                                   Non-clairvoyant scheduling. In Proc. SODA
 [9] H. Hoogeveen. Multicriteria scheduling. European                                     Conference, pages 422–431, 1993.
     Journal of Operational Research, 167(3):592–623,                                [17] S. Muthukrishnan, R. Rajaraman, A. Shaheen, and
     2005.                                                                                J. E. Gehrke. Online scheduling to minimize average
[10] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.                              stretch. In Proc. FOCS Conference, 1999.
     Dryad: Distributed data-parallel programs from                                  [18] K. Pruhs, J. Sgall, and E. Torng. Online scheduling,
     sequential building blocks. In Proc. European                                        chapter 15. Handbook of Scheduling: Algorithms,
     Conference on Computer Systems (EuroSys), 2007.                                      Models, and Performance Analysis. Chapman &
[11] D. Karger, C. Stein, and J. Wein. Scheduling                                         Hall/CRC, 2004.
     algorithms. In M. J. Atallah, editor, Handbook of                               [19] A. S. Schulz. New old algorithms for stochastic
     Algorithms and Theory of Computation. CRC Press,                                     scheduling. In Algorithms for Optimization with
     1997.                                                                                Incomplete Information, Dagstuhl Seminar
[12] E. L. Lawler. Optimal sequencing of a single machine                                 Proceedings, 2005.
     subject to precedence constraints. Management                                   [20] J. Sgall. Online scheduling – a survey. In On-Line
     Science, 19(5):544–546, 1973.                                                        Algorithms, Lecture Notes in Computer Science.
[13] J. Lenstra, A. R. Kan, and P. Brucker. Complexity of                                 Springer-Verlag, 1997.
     machine scheduling problems. Annals of Discrete                                 [21] M. Zukowski, S. Heman, N. Nes, and P. Boncz.
     Mathematics, 1:343–362, 1977.                                                        Cooperative scans: Dynamic bandwidth sharing in a
[14] N. Megow, M. Uetz, and T. Vredeveld. Models and                                      DBMS. In Proc. VLDB Conference, 2007.
     algorithms for stochastic online scheduling.
     Mathematics of Operations Research, 31(3), 2006.

More Related Content

What's hot

Novel Scheduling Algorithm in DFS9(1)
Novel Scheduling Algorithm in DFS9(1)Novel Scheduling Algorithm in DFS9(1)
Novel Scheduling Algorithm in DFS9(1)kota Ankita
 
A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...Rafael Ferreira da Silva
 
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...IRJET Journal
 
An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...eSAT Publishing House
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow ApplicationsSharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applicationsijcsit
 
Distributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisDistributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisIRJET Journal
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterShivraj Raj
 
Improved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling AlgorithmImproved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling Algorithmiosrjce
 
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceAngelo Corsaro
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurationsdbpublications
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balmanbalmanme
 
Independent tasks scheduling based on genetic
Independent tasks scheduling based on geneticIndependent tasks scheduling based on genetic
Independent tasks scheduling based on geneticambitlick
 
Multiple dag applications
Multiple dag applicationsMultiple dag applications
Multiple dag applicationscsandit
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 

What's hot (20)

Novel Scheduling Algorithm in DFS9(1)
Novel Scheduling Algorithm in DFS9(1)Novel Scheduling Algorithm in DFS9(1)
Novel Scheduling Algorithm in DFS9(1)
 
A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...
 
H04502048051
H04502048051H04502048051
H04502048051
 
Ijetr012052
Ijetr012052Ijetr012052
Ijetr012052
 
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
IRJET- Enhance Dynamic Heterogeneous Shortest Job first (DHSJF): A Task Schedu...
 
An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow ApplicationsSharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
 
Distributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisDistributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data Analysis
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
final report
final reportfinal report
final report
 
Improved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling AlgorithmImproved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling Algorithm
 
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution Service
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balman
 
Independent tasks scheduling based on genetic
Independent tasks scheduling based on geneticIndependent tasks scheduling based on genetic
Independent tasks scheduling based on genetic
 
Multiple dag applications
Multiple dag applicationsMultiple dag applications
Multiple dag applications
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
[IJET V2I2P18] Authors: Roopa G Yeklaspur, Dr.Yerriswamy.T
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 

Viewers also liked

http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdfhttp://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdfHiroshi Ono
 

Viewers also liked (6)

qpipe
qpipeqpipe
qpipe
 
chubby_osdi06
chubby_osdi06chubby_osdi06
chubby_osdi06
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdfhttp://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
 
thrift-20070401
thrift-20070401thrift-20070401
thrift-20070401
 
100420
100420100420
100420
 

Similar to vldb08a

Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl systemijdms
 
Weighted Flowtime on Capacitated Machines
Weighted Flowtime on Capacitated MachinesWeighted Flowtime on Capacitated Machines
Weighted Flowtime on Capacitated MachinesSunny Kr
 
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...AM Publications
 
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORSMULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORScscpconf
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Scheduling in cloud computing
Scheduling in cloud computingScheduling in cloud computing
Scheduling in cloud computingijccsa
 
EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...
EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...
EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...ijdpsjournal
 
Survey on Job Schedulers in Hadoop Cluster
Survey on Job Schedulers in Hadoop ClusterSurvey on Job Schedulers in Hadoop Cluster
Survey on Job Schedulers in Hadoop ClusterIOSR Journals
 
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...IAEME Publication
 
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...eSAT Journals
 
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...iosrjce
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
 

Similar to vldb08a (20)

Srinivasan2-10-12
Srinivasan2-10-12Srinivasan2-10-12
Srinivasan2-10-12
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl system
 
Weighted Flowtime on Capacitated Machines
Weighted Flowtime on Capacitated MachinesWeighted Flowtime on Capacitated Machines
Weighted Flowtime on Capacitated Machines
 
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
 
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORSMULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Ijetr012049
Ijetr012049Ijetr012049
Ijetr012049
 
Scheduling in cloud computing
Scheduling in cloud computingScheduling in cloud computing
Scheduling in cloud computing
 
EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...
EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...
EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALL...
 
Survey on Job Schedulers in Hadoop Cluster
Survey on Job Schedulers in Hadoop ClusterSurvey on Job Schedulers in Hadoop Cluster
Survey on Job Schedulers in Hadoop Cluster
 
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...
GENERATIVE SCHEDULING OF EFFECTIVE MULTITASKING WORKLOADS FOR BIG-DATA ANALYT...
 
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
ausgrid 2005
ausgrid 2005ausgrid 2005
ausgrid 2005
 
L017656475
L017656475L017656475
L017656475
 
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...
 
Cjoin
CjoinCjoin
Cjoin
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
 
335 340
335 340335 340
335 340
 

More from Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitectureHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdfHiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfHiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

More from Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Recently uploaded

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Recently uploaded (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

vldb08a

  • 1. Scheduling Shared Scans of Large Data Files Parag Agrawal Daniel Kifer Christopher Olston Stanford University Yahoo! Research Yahoo! Research ABSTRACT the web), and the communication is minimal (distributive We study how best to schedule scans of large data files, in and algebraic aggregation functions enable early aggregation the presence of many simultaneous requests to a common on the Map side of the job, and the data transmitted to the set of files. The objective is to maximize the overall rate of Reduce side is small). Many jobs even disable the Reduce processing these files, by sharing scans of the same file as component, because they do not require global processing aggressively as possible, without imposing undue wait time (e.g., generate a hash-based synopsis of every document in on individual jobs. This scheduling problem arises in batch a large collection). data processing environments such as Map-Reduce systems, The execution time of these jobs is dominated by scanning some of which handle tens of thousands of processing re- the input file. If the number of unique input files is small quests daily, over a shared set of files. relative to the number of daily jobs (e.g., in a search engine As we demonstrate, conventional scheduling techniques company many jobs process the web crawl, user click log, such as shortest-job-first do not perform well in the presence and search query log), then it is desirable to amortize the of cross-job sharing opportunities. We derive a new family work of scanning one of these files across multiple jobs. Un- of scheduling policies specifically targeted to sharable work- fortunately, caching is not good enough because often these loads. Our scheduling policies revolve around the notion data sets are so large that they do not fit in memory, even that, all else being equal, it is good to schedule nonsharable if spread across a large cluster of machines. scans ahead of ones that can share IO work with future jobs, Cooperative scans [6, 8, 21] can help here: multiple jobs if the arrival rate of sharable future jobs is expected to be that require scanning the same file can be executed simulta- high. We evaluate our policies via simulation over varied neously, with the scanning performed once and the scanned synthetic and real workloads, and demonstrate significant data fed into each job’s processing component. The work on performance gains compared with conventional scheduling cooperative scans has focused on mechanisms to realize IO approaches. savings across multiple co-executing jobs. However there is another opportunity here: In the Map-Reduce context jobs tend to run for a long time, and users do not expect quick 1. INTRODUCTION turnaround. It is acceptable to reorder pending jobs, within As disk seeks become increasingly expensive relative to a reasonable limit on delaying individual jobs, if doing so sequential access, data processing systems are being archi- can improve the total amount of useful work performed by tected to favor bulk sequential scans of large files. Database, the system. warehouse and mining systems have incorporated scan- In this paper we study how to schedule jobs that can ben- centric access methods for a long time, but at the mo- efit from shared scans over a common set of files. To our ment the most prominent example of scan-centric archi- knowledge this scheduling problem has not been posed be- tectures is Map-Reduce [4]. Map-Reduce systems execute fore. Existing scheduling techniques such as shortest-job- UDF-enhanced group-by programs over extremely large, dis- first do not necessarily work well in the presence of sharable tributed files. Other architectures in this space include jobs, and it is not obvious how to design ones that do work Dryad [10] and River [1]. well. We illustrate these points via a series of informal ex- Large Map-Reduce installations handle tens of thousands amples (rigorous formal analysis follows). of jobs daily, where a job consists of a scan of a large file ac- companied by some processing and perhaps communication 1.1 Motivating Examples work. In many cases the processing is relatively light (e.g., count the number of times Britney Spears is mentioned on Example 1 Permission to copy without fee all or part of this material is granted provided Suppose the system’s work queue contains two pending jobs, that the copies are not made or distributed for direct commercial advantage, J1 and J2 , which are unrelated (i.e., they scan different files), the VLDB copyright notice and the title of the publication and its date appear, and hence there is no benefit in executing them jointly. and notice is given that copying is by permission of the Very Large Data Therefore we execute them sequentially, and we must de- Base Endowment. To copy otherwise, or to republish, to post on servers cide which one to execute first. We might consider execut- or to redistribute to lists, requires a fee and/or special permission from the ing them in order of arrival (FIFO), or perhaps in order of publisher, ACM. VLDB ‘08, August 24-30, 2008, Auckland, New Zealand expected running time (a policy known as shortest-job-first Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. scheduling, which aims for low average response time in non-
  • 2. sharable workloads). If J1 arrived slightly earlier and has a 2. RELATED WORK slightly shorter execution time than J2 , then both FIFO We are not aware of any prior work that addresses the and shortest-job-first would schedule J1 first. This decision, problem studied in this paper. That said, there is a tremen- which is made without taking sharing into account, seems dous amount of work, in both the database and scheduling reasonable because J1 and J2 are unrelated. theory communities, that is peripherally related. We survey However, one might want to consider the fact that addi- this work below. tional jobs may arrive in the queue while J1 and J2 are being executed. Since future jobs may be sharable with J1 or J2 , 2.1 Database Literature they can influence the optimal execution order of J1 and J2 . Prior work on cooperative scans [6, 8, 21] focused on mech- Even if one does not anticipate the exact arrival schedule of anisms for sharing scans across jobs or queries that get ex- future jobs, a simple stochastic model of future job arrivals ecuted at the same time. Our work is complementary: we can influence the decision of which of J1 or J2 to execute consider how to schedule a queue of pending jobs to ensure first. that sharable jobs get executed together and can benefit Suppose J1 scans file F1 , and J2 scans file F2 . Let λi from cooperative scan techniques. denote the frequency with which jobs that scan Fi are sub- Gupta et al. [7] study how to select an execution order for mitted. In our example, if λ1 > λ2 , then all else being enqueued jobs, to maximize the chance that data cached on equal it might make sense to schedule J2 first. While J2 is behalf of one job can be reused for a subsequent job. That executing, new jobs that are sharable with J1 may arrive, work only takes into account jobs that are already in the permitting us to amortize J1 ’s work across multiple jobs. queue, whereas our work focuses on scheduling in view of This amortization of work, in turn, can lead to lower av- anticipated future jobs. erage job response times going forward. The schedule we produced by considering future job arrival rates differs from 2.2 Scheduling Literature the one produced by FIFO and shortest-job-first. Scheduling theory is a vast field with countless variations on the scheduling problem, including various performance metrics, machine environments (such as single machine, par- Example 2 allel machines, and shop), and constraints (such as release In a more subtle scenario, suppose instead that λ1 = λ2 . times, deadlines, precedence constraints, and preemption) Suppose F1 is 1 TB in size, and F2 is 10 TB. Assume [11]. Some of the earliest complexity results for scheduling each job’s execution time is dominated by scanning the file. problems are given in [13]. In particular, the problem of Hence, J2 takes about ten times as long to execute as J1 . minimizing the sum of completion times on a single proces- Now, which one of J1 and J2 should we execute first? sor in the presence of release dates (i.e. job arrival times) Perhaps J1 should be executed first because J2 can benefit is NP-hard. On the other hand, minimizing the maximum more from sharing, and postponing J2 ’s execution permits absolute or relative wait times can be done in polynomial additional, sharable F2 jobs to accumulate in the queue. On time using the algorithm proposed in [12]. Both of these the other hand, perhaps J2 ought to be executed first since problems are special cases of the problem considered in this it takes roughly ten times as long as J1 , thereby allowing paper when all of the shared costs are zero. ten times as many F1 jobs to accumulate for future joint In practice, the quality of a schedule depends on several execution with J1 . factors (such as maximum completion time, average com- Which of these opposing factors dominates in this case? pletion time, maximum earliness, maximum lateness). Op- How can we reason about these issues in general, in order timizing schedules with respect to several performance met- to maximize system productivity or minimize average job rics is known as multicriteria scheduling [9]. response time? Online scheduling algorithms [18, 20] make scheduling de- cisions without knowledge of future jobs. In non-clairvoyant scheduling [16], the characteristics of the jobs (such as run- 1.2 Contributions and Outline ning time) are not known until the job finishes. Online al- In this paper we formalize and study the problem of schedul- gorithms are typically evaluated using competitive analysis ing sharable jobs, using a combination of analytical and em- [18, 20]: if C(I) is the cost of an online schedule on instance pirical techniques. We demonstrate that scheduling policies I and Copt (I) is the cost of the optimal schedule, then the that work well in the traditional context of nonsharable jobs online algorithm is c-competitive if C(I) ≤ c · Copt (I) + b for can yield poor schedules in the presence of sharing. We all instances I and for some constant b. identify simple policies that do work well in the presence of Divikaran and Saks [5] studied the online scheduling prob- sharing, and are robust to fluctuations in the workload such lem with setup times. In this scenario, jobs belong to job as bursts of job arrivals. families and a setup cost is incurred whenever the proces- The remainder of this paper is structured as follows. We sor switches between jobs of different families. For example, discuss related work in Section 2, and give our formal model jobs in the same family can perform independent scans of of scheduling jobs with shared scans in Section 3. Then in the same file, in which case the setup cost is the time it Section 4 we derive a family of scheduling policies, which takes to load a file into memory. The problem considered have some convenient properties that make them practical in this paper differs in two ways: all jobs executed in one as we discuss in Section 5. We perform some initial empirical batch have the same completion time since the scans occur analysis of our policies in Section 6. Then in Section 7 we concurrently instead of serially; also, once a batch has been extend our family of policies to include hybrid ones that processed, the next batch still has a shared cost even if it is balance multiple scheduling objectives. We present our final from the same job family (for example, if the entire file does empirical evaluation in Section 8. not fit into memory).
  • 3. Given that ts is the dominant cost, for simplicity we treat i the nonshared execution cost tn as being the same for all i jobs in a batch, even though in reality each job may incur a different cost in its custom processing. We verify empirically in Section 6 that nonuniform within-batch processing costs do not throw off our results. 3.1 System Workload For the purpose of our analysis we model job arrival as a stationary process (in Section 8.2.2 we study the effect of bursty job arrivals empirically). In our model, for each job family Fi , jobs arrive according to a Poisson process with Figure 1: Model: input queues and job executor. rate parameter λi . Obviously, a high enough aggregate job arrival rate can Stochastic scheduling [15] considers another variation on overwhelm a given system, regardless of the scheduling pol- the scheduling problem: the processing time of a job is a icy. To reason about what job workload a system is capable random variable, usually with finite mean and variance, and of handling, it is instructive to consider what happens if jobs typically only the distribution or some of its moments are are executed in extremely large batches. In the asymptote, known. Online versions of these problems for minimizing as batch sizes approach infinity, the tn values dominate and expected weighted completion time have also been consid- ts theP values become insignificant, so system load converges ered [3, 14, 19] in cases where there is no sharing of work n to i λi · ti . If this quantity exceeds the system’s intrin- among jobs. sic processing capacity, then it is impossible to keep queue 3. MODEL lengths from growing without bound, and the system can never “catch up” with pending work under any scheduling Map-Reduce and related systems execute jobs on large regime. Hence we impose a workload feasibility condition: clusters, over data files that are spread across many nodes X (each node serves a dual storage and computation role). asymptotic load = λi · tn < 1 i Large files (e.g., a web crawl, or a multi-day search query and i result log) are spread across essentially all nodes, whereas smaller files may only occupy a subset of nodes. Correspond- 3.2 Scheduling Objectives ingly, jobs that access large files are spread onto the entire The performance metric we use in this paper is average cluster, and jobs over small files generally only use a subset perceived wait time. The perceived wait time (PWT) of job of nodes. J is the difference between the system’s response time in In this paper we focus on the issue of ordering jobs to handling J, and the minimum possible response time t(J). maximize shared scans, rather than the issue of how to al- (Response time is the total delay between submission and locate data and jobs onto individual cluster nodes. Hence completion of a job.) for the purpose of this paper we abstract away the per-node As stated in Section 1, the class of systems we consider details and model the cluster as a single unit of storage and is geared toward maximizing overall system productivity, execution. For workloads dominated by large data sets and rather than committing to response time targets for indi- jobs that get spread across the full cluster, this abstraction vidual jobs. This stance would seem to suggest optimizing is appropriate. for system throughput. However, in our context maximiz- Our model of a data processing engine has two parts: an ing throughput means maximizing batch sizes, which leads executor module that processes jobs, and an input queue to indefinite job wait times. While these systems may find that holds pending jobs. Each job Ji requires a scan over a it acceptable to delay some jobs in order to improve overall (large) input file Fi , and performs some custom processing throughput, it does not make sense to delay all jobs. over the content of the file. Jobs can be categorized based Optimizing for average PWT still gives an incentive to on their input file into job families, where all jobs that access batch multiple jobs together when the sharing opportunity file Fi belong to family Fi . It is useful to think of the input is large (thereby improving throughput), but not so much queue as being divided into a set of smaller queues, one per that the queues grow indefinitely. Furthermore, PWT seems job family, as shown in Figure 1. like an appropriate metric because it corresponds to users’ The executor is capable of executing a batch of multiple end-to-end view of system performance. Informally, average jobs from the same family, in which case the input file is PWT can be thought of as an indicator of how unhappy scanned once and each job’s custom processing is applied users are, on average, due to job processing delays. Another over the stream of data generated by scanning the file. For consideration is the maximum PWT across all jobs, which simplicity we assume that one batch is executed at a time, indicates how unhappy the least happy user is. although our techniques can easily be extended to the case Our aim is to minimize average PWT, while keeping maxi- of k simultaneous batches. mum PWT from being excessively high. We focus on steady- The time to execute a batch consisting of n jobs from state behavior, rather than a fixed time period such as one family Fi equals ts + n · tn , where ts represents the cost of i i i day, to avoid knapsack-style tactics that “squeeze” short scanning the input file Fi (i.e., the sharable execution cost), jobs in at the end of the period. Knapsack-style behav- and tn represents the custom processing cost incurred by i ior only makes sense in the context of real-time scheduling, each job (i.e., the nonsharable cost). We assume that ts is i which is not a concern in the class of systems we study. large relative to tn , i.e., the jobs are IO-bound as discussed i For a given job J, PWT can either be measured on an ab- in Section 1. solute scale as the difference between the system’s response
  • 4. symbol meaning Fi ith job family ts i sharable execution time for Fi jobs tn i nonsharable execution time for Fi jobs λi arrival rate of Fi jobs Bi theoretical batch size for Fi ti theoretical time to execute one Fi batch Ti theoretical scheduling period for Fi fi theoretical processing fraction for Fi ωi perceived wait time for Fi jobs Pi scheduling priority of Fi Figure 2: Ways to measure perceived wait time. Bi queue length for Fi Ti waiting time of oldest enqueued Fi job time and the minimum possible response time (e.g., 10 min- utes), or on a relative scale as the ratio of the system’s re- Table 1: Notation. sponse time to the minimum possible response time (e.g., Let Pi denote the scheduling priority of family Fi . If there 1.5 × t(J)). (Relative PWT is also known as stretch [17].) is no sharing, SJF sets Pi equal to the time to complete one The space of PWT metric variants is shown in Figure 2. job. If there is sharing, then we let Pi equal the average For convenience we adopt the abbreviations AA, MA, AR per-job execution time of a job batch. Suppose Bi is the and MR to refer to the four variants. number of enqueued jobs in family Fi , in other words, the 3.3 Scheduling Policy current batch size for Fi . Then the total time to execute a batch is ts + Bi · tn . The average per-job execution time is i i A scheduling policy is an online algorithm that is (re)invoked (ts + Bi · tn )/Bi , which gives us the SJF scheduling priority: i i each time the executor becomes idle. Upon invocation, the „ s « policy leaves the executor idle for some period of time (pos- ti sibly zero time), and then removes a nonempty subset of SJF Policy : Pi = − + tn i Bi jobs from the input queue, packages them into an execution batch, and submits the batch to the executor. Unfortunately, as we demonstrate empirically in Section 6, In this paper, to simplify our analysis we impose two very SJF does not work well in the presence of sharing. To under- reasonable restrictions on our scheduling policies: stand why, consider a simple example with two job families: • No idle. If the input queue is nonempty, do not leave F1 : ts = 1, tn = 0, λ1 = a 1 1 the executor idle. Given the stochastic nature of job F2 : ts = a, tn = 0, λ2 = 1 2 2 arrivals, this policy seems appropriate. for some constant a > 1. • Always share. Whenever a job family Fi is scheduled In this scenario, F2 jobs have long execution time (ts = a) 2 for execution, all enqueued jobs from family Fi are so SJF schedules F2 infrequently: once every a2 time units, included in the execution batch. While it is true that on expectation. The average perceived wait time under this if tn > ts , one achieves lower average absolute PWT schedule is O(a) due to holding back F2 jobs a long time by scheduling jobs sequentially instead of in a batch, between batches. A policy that is aware of the fact that F2 in this paper we assume ts > tn , as stated above. If jobs are relatively rare (λ2 = 1) would elect to schedule F2 ts > tn it is always beneficial to form large batches, in more often, and schedule F1 less often but in much larger terms of average absolute PWT of jobs in the batch. batches. In fact, a policy that schedules F2 every a3/2 time In all cases, large batches reduce the wait time of jobs units achieves an average PWT of only O(a1/2 ). For large outside the batch that are executed afterward. a, SJF performs very poorly in comparison. Since SJF does not always produce good schedules in the presence sharing, we begin from first principles. Unfortu- 4. BASIC SCHEDULING POLICIES nately, as discussed in Section 2.2, solving even the non- We derive scheduling policies aimed at minimizing each of shared scheduling problem exactly is NP-hard. Hence, to average absolute PWT (Section 4.1) and maximum absolute make our problem tractable we consider a relaxed version of PWT (Section 4.2).1 the problem, find an optimal solution to the relaxed prob- The notation we use in this section is summarized in Ta- lem, and apply this solution to the original problem. ble 1. 4.1.1 Relaxation 1 4.1 Average Absolute PWT In our initial, simple relaxation, each job family (each If there is no sharing, low average absolute PWT is achieved queue in Figure 1) has a dedicated executor. The total work via shortest-job-first (SJF) scheduling and its variants. (In done by all executors, in steady state, is constrained to be a stochastic setting, the generalization of SJF is asymptoti- less than or equal to the total work performed by the one cally optimal [3].) We generalize SJF to the case of sharable executor in the original problem. Furthermore, rather than jobs as follows. discrete jobs, in our relaxation we treat jobs as continuously 1 We tried deriving policies that directly aim to minimize rel- arriving, infinitely divisible units of work. ative PWT, but the resulting policies did not perform well, In steady state, an optimal schedule will exhibit periodic perhaps due to breakdowns in the approximation schemes behavior: For each job family Fi , wait until Bi jobs have used to derive the policies. arrived on the queue and execute those Bi jobs as a batch.
  • 5. Given the arrival rate λi , on expectation a new batch is • Old jobs. jobs that are already in the queue when executed every Ti = Bi /λi time units. A batch takes time the Fi batch is executed, are also delayed. Under ti = ts + Bi · tn to complete. The fraction of time Fi ’s i i Relaxation 1, the expected number of such jobs is P executor is in use (rather than idle), is fi = ti /Ti . j=i (Tj · λj )/2. The delay incurred to each one is We arrive at the following optimization problem: ti , making the overall delay incurred to other in-queue X X AA jobs equal to fi ≤ 1 min λi · ω i i i ti X D3 = · (Tj · λj ) AA 2 where ωi is the average absolute PWT for jobs in Fi . j=i There are two factors that contribute to the PWT of a newly-arrived job: (1) the delay until the next batch is The total delay imposed on other jobs per unit time is formed (2) the fact that a batch of size Bi takes longer to proportional to 1/Ti · (D1 + D2 + D3 ). If we minimize the finish than a singleton batch. The expected value of Factor sum of this quantity across all families Fi , again subject P 1 is Ti /2. Factor 2 equals (Bi − 1) · tn . Overall, i to the resource utilization constraint i fi ≤ 1 using the Lagrange method, we obtain the following invariant across AA Ti ωi = + (Bi − 1) · tn i job families: 2 ! 2 2 We solve the above optimization problem using the method Bi s X Bi n n X − ti · λj + · (λi · ti ) · ti · λj + 1 of Lagrange Multipliers. In the optimal solution the follow- λi · ts i λi · ts i j j ing quantity is constant across all job families Fi : 2 The scheduling policy resulting from this invariant does Bi · (1 + 2 · λi · tn ) i achieve the hoped-for O(a1/2 ) average PWT in our example λi · ts i two-family scenario. Given the λ, ts and tn values, one can select batch sizes (B values) accordingly. 4.1.3 Implementation and Intuition P n Recall the workload feasibility condition i λi · ti < 1 4.1.2 Relaxation 2 from Section 3.1. If the executor’s load is spread across a Unfortunately, the optimal solution to Relaxation 1 can large number of job families, then for each Fi , λi ·tn is small. i differ substantially from the optimal solution to the origi- Hence, it is reasonable to drop the terms involving λi · tn i nal problem. Consider the simple two-family example we from our above formulae, yielding the following simplified presented earlier in Section 4.1. The optimal policy under invariants2 : Relaxation 1 schedules job families in a round robin fashion, yielding an average PWT of O(a). Once again this result is • Relaxation 1 result: For all job families Fi , the much worse than the achievable O(a1/2 ) value we discussed following quantity is equal: earlier. Bi 2 Whereas SJF errs by scheduling F2 too infrequently, the λi · ts i optimal Relaxation 1 policy errs in the other direction: it schedules F2 too frequently. Doing so causes F1 jobs to wait • Relaxation 2 result: For all job families Fi , the behind F2 batches too often, hurting average wait time. following quantity is equal: The problem is that Relaxation 1 reduces the original 2 scheduling problem to a resource allocation problem. Under Bi X − ts · i λj Relaxation 1, the only interaction among job families is fact s λi · ti P j that they must share the overall processing time ( i fi ≤ 1). In reality, resource allocation is not the only important con- A simple way to translate these statements into imple- sideration. We must also take into account the fact that the mentable policies is as follows: Assign a numeric priority execution batches must be serialized into a single sequen- Pi to each job family Fi . Every time the executor becomes tial schedule and executed on a single executor. When a idle schedule the family with the highest priority, as a sin- long-running batch is executed, other batches must wait for gle batch of Bi jobs, where Bi denotes the queue length for a long time. family Fi . If we are in steady state, then Bi should roughly Consider a job family Fi , for which a batch of size Bi is equal Bi . This observation suggests the following priority executed once every Ti time units. Whenever an Fi batch values for the scheduling policies implied by Relaxations 1 is executed, the following contributions to PWT occur: and 2, respectively: • In-batch jobs. The Bi Fi jobs in the current batch are delayed by (Bi − 1) · tn time units each, for a total i Bi 2 of D1 = Bi · (Bi − 1) · tn time units. AA Policy 1 : Pi = i λi · ts i 2 • New jobs. jobs that arrive while the Fi batch is being Bi X AA Policy 2 : Pi = − ts · i λj executed, are delayed. The expected number of such s λi · ti P j jobs is ti · j λj . The delay incurred to each one is ti /2 on average, making the overall delay incurred to other new jobs equal to 2 t2 X There are also practically-motivated reasons to drop terms D2 = i · λj involving tn , as we discuss in Section 5. In Section 6 we give 2 j empirical justification for dropping the tn terms.
  • 6. These formulae have a fairly simple intuitive explanation. This policy can be thought of as FIFO applied to job family First, if many new jobs with a high degree of sharing are batches, since it schedules the family of the job that has expected to arrive in the future (λi · ts in the denomina- i been waiting the longest. tor, which we refer to as the sharability of family Fi ), we should postpone execution of Fi and allow additional jobs to accumulate into the same batch, so as to achieve greater 5. PRACTICAL CONSIDERATIONS sharing with little extra waiting. On the other hand, as the The scheduling policies we derived in Section 4 rely on 2 several parameters related to job execution cost and job ar- number of enqueued jobs becomes large (Bi in the numer- ator), the execution priority increases quadratically, which rival rates. In this section we explain how these parameters eventually forces the execution of a batch from family Fi to can be obtained in practice. avoid imposing excessive delay on the enqueued jobs. Policy 2 has an extra subtractive term, which penalizes Robust cost estimation: The fact that we were able to long batches (i.e., ones with largePs ) if the overall rate of t drop the nonsharable execution time tn from our scheduling arrival of jobs is high (i.e., high priority formulae not only keeps them simple, it also means j λj ). Doing so allows short batches to execute ahead of long batches, in the spirit that the scheduler does not need to estimate this quantity. of shortest-job-first. In practice, estimating the full execution time of a job accu- For singleton job families (families with just one job), rately can be difficult, especially in the Map-Reduce context ts = 0 and the priority value Pi goes to infinity. Hence in which processing is specified via opaque user-defined func- i nonsharable jobs are to be scheduled ahead of sharable ones. tions. (In Section 6 we verify empirically that the perfor- The intuition is that nonsharable jobs cannot be beneficially mance of our policies is not sensitive to whether the factors coexecuted with future jobs, so we might as well execute involving tn are included.) them right away. If there are multiple nonsharable jobs, ties Our formulae do require estimates of the sharable exe- can be broken according to shortest-job-first. cution time ts , i.e., the IO cost of scanning the input file. For large files, this cost is nearly linearly proportional to 4.2 Maximum Absolute PWT the size of the input file, a quantity that is easy to obtain Here, instead of optimizing for average absolute PWT, from system metadata. (The proportionality constant can we optimize for the maximum. We again adopt a relaxation be dropped, as linear scaling of the ts values does not affect of the original problem that assumes parallel executors and our priority-based scheduling policies.) infinitely divisible work. Under the relaxation, the objective Dynamic estimation of arrival rates: Some of our pri- function is: ority formulae contain λ values, which denote job arrival MA rates. Under the Poisson model of arrival, one can estimate min max ωi i the λ values dynamically, by keeping a time-decayed count where MA ωi is the maximum absolute PWT for Fi jobs. of arrivals. In this way the arrival rate estimates (λ values) As stated in Section 4.1.1 there are two factors that con- automatically adjust as the workload shifts over time. (See tribute to the PWT of a newly-arrived job: (1) the delay Section 6.1 for details.) until the next batch is formed (2) the fact that a batch of size Bi takes longer to finish than a singleton batch. The maximum values of these factors are Ti and (Bi − 1) · tn , 6. BASIC EXPERIMENTS i respectively. Overall, In this section we present experiments that: • Justify ignoring the nonsharable execution time compo- ωi = Ti + (Bi − 1) · tn MA i nent tn in our scheduling policies (Section 6.2). or, written differently: • Compare our scheduling policy variants empirically (Sec- tion 6.3). ωi = Ti · (1 + λi · tn ) − tn MA i i (We compare our policies against baseline policies in Sec- MA In the optimal solution ωi is constant across all job fam- tion 8.) ilies Fi . The intuition behind this result is that if one of MA the ωi values is larger than the others, we can decrease it 6.1 Experimental Setup MA We built a simulator and a workload generator. Our work- somewhat by increasing the other ωi values, thereby re- ducing the maximum PWT. Hence in the optimal solution load consists of 100 job families. For each job family, the MA all ωi values are equal. sharable cost ts is generated from the heavy-tailed distri- bution 1 + |X |, where X is a Cauchy random variable. For 4.2.1 Implementation and Intuition greater realism, the nonsharable cost tn is on a per-job basis, As justified in Section 4.1.3, we drop terms involving λi · rather than a per-family basis as in our model in Section 3. tn from our ω MA formula and obtain ω MA ≈ Ti − tn . As In our default workload, each time a job arrives, we select i i stated in Section 3, we assume the tn values to be a small a nonshared cost randomly as follows: with probability 0.6, component of the overall job execution times, so we also drop tn = 0.1 · ts ; with probability 0.2, tn = 0.2 · ts ; with prob- the −tn term and arrive at the approximation ω MA ≈ Ti . ability 0.2, tn = 0.3 · ts . (The scenario we focus on in this i Let Ti denote the waiting time of the oldest enqueued Fi paper is one in which the shared cost dominates, because it job, which should roughly equal Ti in steady state. We use represents IO and jobs tend to be IO-bound, as discussed in Ti as the basis for our priority based scheduling policy: Section 3.) In some of our experiments we deviate from this default workload and study what happens when tn tends to MA Policy(FIFO) : Pi = Ti be larger than ts .
  • 7. 700 1600 tn-ignorant policy for AA PWT AA Policy 2, lambda known 600 tn-aware policy for AA PWT 1400 AA Policy 1, est. lambda 1200 AA Policy 2, est. lambda 500 AA Policy 1, known lambda 1000 AA PWT AA PWT 400 800 300 600 200 400 100 200 0 0 1 5 10 33 66 100 0 1 2 3 4 4.3 Shared cost divisor Shared cost skew Figure 3: tn -awareness versus tn -ignorance for AA Figure 5: AA Policy 1 versus AA Policy 2, varying Policy 2. shared cost skew. 2500 30 tn-ignorant policy for MA PWT AA Policy 2, est. lambda tn-aware policy for MA PWT AA Policy 1, est. lambda Average Absolute PWT 25 2000 20 MA PWT 1500 15 1000 10 500 5 0 0 1 5 10 33 66 100 1 5 10 20 30 40 Shared cost divisor Skew Figure 4: tn -awareness versus tn -ignorance for MA Figure 6: AA Policy 1 versus AA Policy 2, varying Policy. shared cost skew, λi ts = const. i Job arrival events are generated using the standard ho- job instance in the queue.) mogenous Poisson point process [2]. Each job family Fi has Figures 3 and 4 plot the performance of the tn -aware and an arrival parameter λi which represents the expected num- tn -ignorant variants of our policies (AA Policy 2 and MA ber of jobs that arrive in one unit of time. There are 500, 000 Policy, respectively) as we vary the magnitude of the shared units of time in each run of the experiments. The λi values cost (keeping the tn distribution and λ values fixed). In are initially chosen from a Pareto distribution with parame- both graphs, the y-axis plots the metric the policy is tuned ter α = 1.9 and then are rescaled so that i λi E[tn ] = load. P i to optimize (AA PWT and MA PWT, respectively). The The total asymptotic system load ( λi ·tn ) is 0.5 by default. P i x-axes plot the shared cost divisor, which is the factor by Some of our scheduling policies require estimation of the which we divided all shared costs. When the shared cost job arrival rate λi . To do this, we maintain an estimate Ii divisor is large (e.g., 100), the ts values become quite small of the difference in the arrival times of the next two jobs relative to the tn values, on average. in family Fi . We adjust Ii as new job arrivals occur, by Even when nonshared costs are large relative to shared taking a weighted average of our previous estimate Ii and costs (right-hand side of Figures 3 and 4), tn -awareness has Ai , the difference in arrival times of the two most recent jobs little impact on performance. Hence from this point forward from Fi . Formally, the update step is Ii ← 0.05Ai + 0.95Ii . we only consider the simpler, tn -ignorant variants of our Given Ii and the time t since the last arrival of a job in Fi , policies. 1 we estimate λi as 1/Ii if t < Ii and as 0.05+0.95Ii otherwise. 6.3 Comparison of Policy Variants 6.2 Influence of Nonshared Execution Time In our first set of experiments, we measure how knowl- 6.3.1 Relaxation 1 versus Relaxation 2 edge of tn affects our scheduling policies. Recall that in We now turn to a comparison of AA Policy 1 versus AA Sections 4.1.3 and 4.2.1 we dropped tn from the priority Policy 2 (recall that these are based on Relaxation 1 (Sec- formulae, on the grounds that the factors involving tn are tion 4.1.1) and Relaxation 2 (Section 4.1.2) of the original small relative to other factors. To validate ignoring tn in our AA PWT minimization problem, respectively). Figure 5 scheduling policies, we compare tn -aware variants (which shows that the two variants exhibit nearly identical perfor- use the full formulae with tn values) against the tn -ignorant mance, even as we vary the skew in the shared cost (ts ) dis- variants presented in Sections 4.1.3 and 4.2.1. (The tn -aware tribution among job families (here there are five job families variants are given knowledge of the precise tn value of each Fi with shared cost ts = iα , where α is the skew parameter). i
  • 8. 6.4 Summary of Findings The findings from our basic experiments are: • Estimating the arrival rates (λ values) online, as op- posed to knowing them from an oracle, does not hurt performance. • It is not necessary to incorporate tn estimates into the priority functions. • AA Policy 2 (which is based on Relaxation 2) domi- nates AA Policy 1 (based on Relaxation 1). From this point forward, we use tn -ignorant AA Policy 2 with online λ estimation. 7. HYBRID SCHEDULING POLICIES The quality of a scheduling policy is generally evaluated using several criteria [9] and so optimizing for either the av- Figure 7: Relative effectiveness of different priority erage or maximum perceived wait time, as in Section 4, may formula variants. be too extreme. If we optimize solely for the average, there However, if we introduce the invariant that λi · ts (which may be certain jobs with very high PWT. Conversely if we i represents the “sharability” of jobs in family Fi ; see Sec- optimize solely for the maximum, we end up punishing the tion 4.1.3) remain constant across all job families Fi , a dif- majority of jobs in order to help a few outlier jobs. In prac- ferent picture emerges. Figure 6 shows the result of varying tice it may make more sense to optimize for a combination the shared cost skew, as we hold λi · ts constant across job of average and maximum PWT. A simple approach is to i families. (Here there are two job families: ts = λ1 = 1 and optimize for a linear combination of the two: 2 ts = λ2 = skew parameter (x-axis).) In this case, we see a X 1 min α · ωi + (1 − α) · ωi A AA M clear difference in performance between the policies based i on the two relaxations, with the one based on Relaxation 2 AA (AA Policy 2) performing much better. where ω denotes average absolute PWT and ω M A denotes Overall, it appears that AA Policy 2 dominates AA Policy maximum absolute PWT. The parameter α ∈ [0, 1] denotes 1, as expected. As to whether the case in which AA Policy the relative importance of having low average PWT versus 2 performs significantly better than AA Policy 1 is likely low maximum PWT. to occur in practice, we do not know. Clearly, using AA We apply the methods used in Section 4 to the hybrid Policy 2 is the safest option, and besides it is not much optimization objective, resulting in the following policy: more complex to implement than AA Policy 1. quot; # 2 1 Bi X Hybrid Policy : Pi = α · P · s − ts · i λj 2· j λj λi · ti j 6.3.2 Use of Different Estimators Ti2 Recall that our AA Policies 1 and 2 (Section 4.1.3) have a + xi · (1 − α) · 2 ts i Bi /λi term. In the model assumed by Relaxation 1, using the equivalence Bi = Ti · λi , we can rewrite this term in where xi = 1 if Ti = maxj Tj , and xi = 0 otherwise. four different ways: Bi /λi (using batch size), Ti2 λi (using 2 waiting time), Bi ·Tˆ (the geometric mean of the two previous i The hybrid policy degenerates to the nonhybrid policies options), and max Bi /λi , Ti2 · λi . 2 ˜ of Section 4 if we set α = 0 or α = 1. For intermediate In Figure 7 we compare these variants, and also compare values of α, job families receive the same relative priority using the true λ values versus using an online estimator for as they would under the average PWT regime, except the λ as described in Section 6.1. We used a more skewed non- family that has been waiting the longest (i.e., the one with shared cost (tn ) distribution than in our other experiments, xi = 1), which gets an extra boost in priority. This “extra to get a clear separation of the variants. In particular we boost” reduces the maximum wait time, while raising the used: with probability 0.6, tn = 0.1 · ts ; with probability 0.2, average wait time a bit. tn = 0.2·ts ; with probability 0.1, tn = 0.5·ts ; with probabil- ity 0.1, tn = 1.0·ts . We generated 20 sample workloads, and for each workload we computed the best AA PWT among 8. FURTHER EXPERIMENTS the policy variants. For each policy variant, Figure 7 plots We are now ready for further experiments. In particular the fraction of times the policy variant had an AA PWT that we study: was more than 3% worse than the best AA PWT for each • The behavior of our hybrid policy (Section 8.1). 2 workload. The result is that the variant that uses Bi /λi • The performance of our policies compared to baseline (the form given in Section 4.1.3) clearly outperforms the policies (Section 8.2.1). rest. Furthermore, estimating the arrival rates (λ values) works fine, compared to knowing them in advance via an • The ability to cope with large bursts of job arrivals (Sec- oracle. tion 8.2.2).
  • 9. 420 4500 25000 AA PWT FIFO 400 4000 MA PWT AA Policy 2 Average Absolute PWT Average Absolute PWT 380 20000 Max Absolute PWT 360 3500 Hybrid Oblivious SJF 340 3000 15000 Aware SJF 320 2500 300 280 2000 10000 260 1500 240 5000 220 1000 200 500 0 0 0.20 0.40 0.60 0.80 0.95 0.99 1.00 0.1 0.3 0.5 0.7 0.9 Hybrid Parameter Load (lambda increasing) Figure 8: Hybrid Policy performance on average and Figure 10: Policy performance on AA PWT met- maximum absolute PWT, as we vary the hybrid pa- ric, as job arrival rates increase (both SJF variants rameter α. shown). 200 2000 α = 0.99 against two generalizations of shortest-job-first AR PWT (SJF): The policy “Aware SJF” is the one given in Sec- 180 MR PWT 1800 tion 4.1, which knows the nonshared cost of jobs in its queue, Average Relative PWT Max Relative PWT 160 1600 and chooses the job family for which it can execute the most number of jobs per unit of time (i.e., the family that min- 140 1400 imizes (batch execution cost)/B). By a simple interchange 120 1200 argument it can be shown that this policy is optimal for the case when jobs have stopped arriving. The policy “Oblivi- 100 1000 ous SJF” does not know the nonshared cost of jobs and so it 80 800 chooses the family for which ts /B is minimized. This policy 60 600 is optimal for the case when jobs have stopped arriving and 0 0.20 0.40 0.60 0.80 0.95 0.99 1.00 the nonshared costs are small. In these experiments we tested how these policies are af- Hybrid Parameter fected by the total load placed on the system. (Recall from λi · tn .) To vary load, P Section 3.1 that asymptotic load = i Figure 9: Hybrid Policy performance on average and we started with workloads with asymptotic load = 0.1, and maximum relative PWT, as we vary α. then caused load to increase by various increments, in one of two ways: (1) increase the nonshared costs (tn values), or 8.1 Hybrid Policy (2) increase the job arrival rates (λ values). In both cases, Figure 8 shows the performance of our Hybrid Policy (Sec- all other workload parameters are held constant. tion 7), in terms of both average and maximum absolute In Section 8.2.1 we report results for the case where job PWT. Figure 9 shows the same thing, but for relative PWT. arrivals are generated by a homogeneous Poisson point pro- In both graphs the x-axis plots the hybrid parameter α (this cess. In Section 8.2.2 we report results under bursty arrivals. axis is not on a linear scale, for the purpose of presentation). The decreasing curve plots average PWT, whose scale is on 8.2.1 Stationary Workloads the left-hand y-axis; the increasing curve plots maximum In Figure 10 we plot AA PWT as the job arrival rate, PWT, whose scale is on the right-hand y-axis. and thus total system load, increases. It is clear that Aware With α = 0, the hybrid policy behaves like the MA Pol- SJF has terrible performance. The reason is as follows: In icy (FIFO), which achieves low maximum PWT at the ex- our workload generator, expected nonshared costs are pro- pense of very high average PWT. On the other extreme, portional to shared costs (e.g., the cost of a CPU scan of with α = 1 it behaves like the AA Policy, which achieves the file is roughly proportional to its size on disk). Hence, low average PWT but very high maximum PWT. Using in- Aware SJF has a very strong preference for job families with termediate values of α trades off the two objectives. In both small shared cost (essentially ignoring the batch size), which the absolute and relative cases, a good balance is achieved leads to starvation of ones with large shared cost. at approximately α = 0.99: maximum PWT is only slightly In the rest of our experiments we drop Aware SJF, so we higher than with α = 0, and average PWT is only slightly can focus on the performance differences among the other higher than with α = 1. policies. Figure 11 is the same as Figure 10, with Aware Basically, when configured with α = 0.99, the Hybrid Pol- SJF removed and the y-axis re-scaled. Here we see that AA icy mimics the AA Policy most of the time, but makes an Policy 2 and the Hybrid Policy outperform both FIFO and exception if it notices that one job has been waiting for a SJF, especially at higher loads. very long time. In Figure 12 we show the corresponding graph with MA PWT on the y-axis. Here, as expected, FIFO and the Hy- 8.2 Comparison Against Baselines brid Policy perform very well. In the following experiments, we compare the policies AA Figures 13 and 14 show the corresponding plots for the Policy 2, MA Policy (FIFO), and the Hybrid Policy with case where load increases due to a rise in nonshared cost.
  • 10. 3000 1800 FIFO FIFO 1600 AA Policy 2 AA Policy 2 Average Absolute PWT Average Absolute PWT 2500 Hybrid 1400 Hybrid 2000 Oblivious SJF 1200 Oblivious SJF 1000 1500 800 1000 600 400 500 200 0 0 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 Load (lambda increasing) Load (non-shared cost increasing) Figure 11: Policy performance on AA PWT metric, Figure 13: Policy performance on AA PWT metric, as job arrival rates increase. as nonshared costs increase. 70000 40000 FIFO FIFO 60000 AA Policy 2 35000 AA Policy 2 Max Absolute PWT Hybrid Max Absolute PWT 50000 30000 Hybrid Oblivious SJF Oblivious SJF 40000 25000 20000 30000 15000 20000 10000 10000 5000 0 0 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 Load (lambda increasing) Load (non-shared cost increasing) Figure 12: Policy performance on MA PWT metric, Figure 14: Policy performance on MA PWT metric, as job arrival rates increase. as nonshared costs increase. These graphs are qualitatively similar to Figures 11 and 12, increases via increasing non-shared costs. Here, SJF slightly but the differences among the scheduling policies are less outperforms our policies on AA PWT, but our Hybrid Policy pronounced. performs well on both average and maximum PWT. Figures 15, 16, 17 and 18 are the same as Figures 11, Figure 21 shows average absolute PWT as the job arrival 12, 13 and 14, respectively, but with the y-axis measuring rate increases, while keeping the nonshared cost distribution relative PWT. If we are interested in minimizing relative constant. Here AA Policy 2 and Hybrid slightly outperform PWT, our policies, which aim to minimize absolute PWT, SJF. do not necessarily do as well as SJF. Devising policies that To visualize the temporal behavior in the presence of bursts, specifically optimize for relative PWT is an important topic Figure 22 shows a moving average of absolute PWT on the of future work. y-axis, with time plotted on the x-axis. This time series is a sample realization of the experiment that produced Figure 8.2.2 Bursty Workloads 19, with load = 0.7. To model bursty job arrival behavior we use two different Since our policies focus on exploiting job arrival rate (λ) Poisson processes for each job family. One Poisson process estimates, it is not surprising that under extremely bursty corresponds to a low arrival rate and the other corresponds workloads where there is no semblance of a steady-state λ, to an arrival rate that is ten times as fast. We switch be- they do not perform as well relative to the baselines as un- tween these processes using a Markov process: after a job der stationary workloads (Section 8.2.1). However, it is re- arrives, we switch states (from high arrival rate to low ar- assuring that our Hybrid Policy does not perform noticeably rival rate or vice versa) with probability 0.05, and stay in worse than shortest-job-first, even under these extreme con- the same state with probability 0.95. The initial probability ditions. of either state is the stationary distribution of this process (i.e. with probaility 0.5 we start with a high arrival rate). 8.3 Summary of Findings The expected number of jobs coming from bursts is the same The findings from our experiments on the absolute PWT as the expected number of jobs not coming from bursts. If metric which our policies are designed to optimize, are: λi is the arrival rate for the non-burst process, then the ex- pected λi (number of jobs per second) asymptotically equals 20λi /11. Thus the load is i E[λi ]E[tn ]. P i • Our MA Policy (a generalization of FIFO to shared In Figures 19 and 20 we show the average and maximum workloads) is the best policy on maximum PWT, but absolute PWT, respectively, for bursty job arrivals as load performs poorly on average PWT, as expected.
  • 11. 1200 400 FIFO FIFO AA Policy 2 350 AA Policy 2 Average Relative PWT Average Relative PWT 1000 Hybrid 300 Hybrid 800 Oblivious SJF Oblivious SJF 250 600 200 150 400 100 200 50 0 0 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 Load (lambda increasing) Load (non-shared cost increasing) Figure 15: Policy performance on AR PWT metric, Figure 17: Policy performance on AR PWT metric, as job arrival rates increase. as nonshared costs increase. 8000 4000 FIFO FIFO 7000 3500 AA Policy 2 AA Policy 2 Max Relative PWT Hybrid Max Relative PWT 6000 Hybrid 3000 Oblivious SJF Oblivious SJF 5000 2500 4000 2000 3000 1500 2000 1000 1000 500 0 0 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 Load (lambda increasing) Load (non-shared cost increasing) Figure 16: Policy performance on MR PWT metric, Figure 18: Policy performance on MR PWT metric, as job arrival rates increase. as nonshared costs increase. • Our Hybrid Policy, if properly tuned, achieves a “sweet a combination of the two objectives. Compared with the spot” in balancing average and maximum PWT, and baseline shortest-job-first and FIFO policies, which do not is able to perform quite well on both. account for future sharing opportunities, our policies achieve significantly lower perceived wait time. This means that • With stationary workloads, our Hybrid Policy substan- users’ jobs will generally complete earlier under our schedul- tially outperforms the better of two generalizations of ing policies. shortest-job-first to shared workloads. • With extremely bursty workloads, our Hybrid Policy performs on par with shortest-job-first. 10. REFERENCES [1] R. H. Arpaci-Dusseau. Run-time adaptation in River. ACM Trans. on Computing Systems, 21(1):36–86, Feb. 9. SUMMARY 2003. In this paper we studied how to schedule jobs that can [2] P. Billingsley. Probability and Measure. John Wiley & share scans over a common set of input files. The goal is to Sons, Inc., New York, 3nd edition, 1995. amortize expensive file scans across many jobs, but without [3] M. C. Chou, H. Liu, M. Queyranne, and unduly hurting individual job response times. D. Simchi-Levi. On the asymptotic optimality of a Our approach builds a simple stochastic model of job simple on-line algorithm for the stochastic arrivals for each input file, and takes into account antici- single-machine weighted completion time problem and pated future jobs while scheduling jobs that are currently its extensions. Operations Research, 54(3):464–474, enqueued. The main idea is as follows: If an enqueued job 2006. J requires scanning a large file F , and we anticipate the [4] J. Dean and S. Ghemawat. MapReduce: Simplified near-term arrival of additional jobs that also scan F , then it data processing on large clusters. In Proc. OSDI, 2004. may make sense to delay J if it has not already waited too [5] S. Divakaran and M. Saks. Online scheduling with long and other, less sharable, jobs are available to run. release times and set-ups. Technical Report 2001-50, We formalized the problem and derived a simple and ef- DIMACS, 2001. fective scheduling policy, under the objective of minimizing [6] P. M. Fernandez. Red brick warehouse: A read-mostly perceived wait time (PWT) for completion of user jobs. Our RDBMS for open SMP platforms. In Proc. ACM policy can be tuned for average PWT, maximum PWT, or SIGMOD, 1994.
  • 12. 9000 2500 FIFO FIFO 8000 AA Policy 2 AA Policy 2 Average Absolute PWT Average Absolute PWT 7000 2000 Hybrid Hybrid 6000 Oblivious SJF Oblivious SJF 1500 5000 4000 1000 3000 2000 500 1000 0 0 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 Load (non-shared cost increasing) Load with bursts (non-shared cost increasing) Figure 19: Policy performance on AA PWT metric, Figure 21: Policy performance on AA PWT metric, as nonshared costs increase, with bursty job arrivals. as arrival rates increase, with bursty job arrivals. 140000 FIFO 120000 AA Policy 2 Max Absolute PWT Hybrid 100000 Oblivious SJF 80000 60000 40000 20000 0 0.1 0.3 0.5 0.7 0.9 Load (non-shared cost increasing) Figure 20: Policy performance on MA PWT metric, as nonshared costs increase, with bursty job arrivals. Figure 22: Performance over time, with bursty job arrivals. [7] A. Gupta, S. Sudarshan, and S. Vishwanathan. Query scheduling in multiquery optimization. In [15] R. H. M¨hring, F. J. Radermacher, and G. Weiss. o International Symposium on Database Engineering Stochastic scheduling problems I – general strategies. and Applications (IDEAS), 2001. Mathematical Methods of Operations Research, [8] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. 28(7):193–260, 1984. QPipe: A simultaneously pipelined relational query [16] R. Motwani, S. Phillips, and E. Torng. engine. In Proc. ACM SIGMOD, 2005. Non-clairvoyant scheduling. In Proc. SODA [9] H. Hoogeveen. Multicriteria scheduling. European Conference, pages 422–431, 1993. Journal of Operational Research, 167(3):592–623, [17] S. Muthukrishnan, R. Rajaraman, A. Shaheen, and 2005. J. E. Gehrke. Online scheduling to minimize average [10] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. stretch. In Proc. FOCS Conference, 1999. Dryad: Distributed data-parallel programs from [18] K. Pruhs, J. Sgall, and E. Torng. Online scheduling, sequential building blocks. In Proc. European chapter 15. Handbook of Scheduling: Algorithms, Conference on Computer Systems (EuroSys), 2007. Models, and Performance Analysis. Chapman & [11] D. Karger, C. Stein, and J. Wein. Scheduling Hall/CRC, 2004. algorithms. In M. J. Atallah, editor, Handbook of [19] A. S. Schulz. New old algorithms for stochastic Algorithms and Theory of Computation. CRC Press, scheduling. In Algorithms for Optimization with 1997. Incomplete Information, Dagstuhl Seminar [12] E. L. Lawler. Optimal sequencing of a single machine Proceedings, 2005. subject to precedence constraints. Management [20] J. Sgall. Online scheduling – a survey. In On-Line Science, 19(5):544–546, 1973. Algorithms, Lecture Notes in Computer Science. [13] J. Lenstra, A. R. Kan, and P. Brucker. Complexity of Springer-Verlag, 1997. machine scheduling problems. Annals of Discrete [21] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Mathematics, 1:343–362, 1977. Cooperative scans: Dynamic bandwidth sharing in a [14] N. Megow, M. Uetz, and T. Vredeveld. Models and DBMS. In Proc. VLDB Conference, 2007. algorithms for stochastic online scheduling. Mathematics of Operations Research, 31(3), 2006.