Long-running, short-running, memory-intensive, cpu-bound…don’t know what kind of jobs to expect. So how can the scheduler put them where they should be if it doesn’t know these things? Transition: Wouldn’t it be nice if the scheduler could just “handle it” – without the user having specify characteristics of their jobs in advance?
Our approach to this problem is DIOS – an adaptive distributed scheduler. Describe diagram. Transition: So you must be thinking…wait, how are you going to just “gather application-specific info”?
The answer is – we’ll write a tool with Pin, a dynamic instrumentation framework. Describe diagram – how it’s a mini VM. Describe points for inserting instrumentation and the tradeoffs – routine level, instruction level.
So we’ve established that Pin is a tool for what we want to do – dynamically instrument applications. But what code do we want to insert? What are we looking to get from our pintool? Since we are trying to detect and avoid memory contention between processes, it makes senses to study the memory behavior of the applications. To this end, we chose three things to start with (describe them). The figure to side there shows how the pintool fits in to our overall plan – it would collect information for each application and report the results to Hare, the local scheduler. Then Hare, which is also monitoring the memory subsystem of the local machine, reports to Rhino, and Rhino decides what to do.
Considering our motivation, it was important to try to evaluate it on a somewhat realistic workload. Since it seems like most long-running jobs on clusters are scientific applications, we wanted to use real scientific benchmarks. Describe benchmarks. To evaluate the scheduler, we measured the total runtime from… Then, to evaluate our pintool, we measured the overhead from running each application with our pintool and also tracked the information we collected over time to see if we could correlate it to interesting behavior or differences between programs.
Potential for improvement – we saw this from our baseline, using a simple policy to react to the presence of memory contention. Might be able to get even better results on long-running jobs, with better information on the running processes (like we could get from dynamic instrumentation!)
But on the other hand, there’s the bad. Although our scheduler works perfectly well with the pintool, we discovered that the overhead introduced by Pin is just too much. Some of our overhead results are below – we show the time to run the application natively, with pin (no pintool), with a tool that only counts instructions, and with our three metrics. The way we hoped to solve the overhead problem originally was to basically only instrument when we needed to –like when the scheduler decided the machine was performing badly. Then, the relatively high overhead to run the analysis wouldn’t have to make much of an impact overall. However, we were unable to get the performance gains we hoped – Pin doesn’t offer the ability to completely attach and detach from a running program, only to attach, and we discovered when we tried to add and remove insturmentation dynamically that we lost the gains from code caching. So while this idea could work with another system or with a new Pin, we couldn’t manage to bring the overhead down.
But on the bright side, at least it collected some interesting information. Note how similar the patterns of LU and heatedplate are – talk about how that’s probably because they are tightly looped and very repetitive, whereas Ocean is obviously performing a more irregular and complex analysis with some possible distinct phases in it. Possibility of using the variation in a metric like this to “predict the predictability” to separate applications that are better left alone from those that are more likely to be safely handled by common heuristics, etc.