Driving Behavioral Change for Information Management through Data-Driven Gree...
Hardware-aware thread scheduling: the case of asymmetric multicore processors
1. Hardware-aware thread scheduling: the
case of asymmetric multicore processors
Achille Peternier*, Danilo Ansaloni, Daniele Bonetta,
Cesare Pautasso and Walter Binder
* achille.peternier@usi.ch
http://sosoa.inf.unisi.ch
3. Context
• Modern CPUs increase the computational
power through additional cores
• HW architectures are becoming increasingly
more complex
– Shared caches
– Non Uniform Memory Access (NUMA)
– Single Instruction Multiple Data (SIMD) registers
– Simultaneous MultiThreading (SMT) units
3
4. Context
• Operating System (OS) kernel and scheduler
try to automatically optimize applications’
performance according to the available
resources
– Based on the underlying HW
– Using a limited set of performance indicators (CPU
time, memory usage, etc.)
4
5. “Today it is impossible to estimate performance:
you have to measure it. Programming has become
an empirical science.”
Performance Anxiety: Performance analysis in the new millennium
Joshua Bloch, Google Inc.
5
6. Contributions
1) Automated workload analysis technique relying on a
specific set of performance metrics that are currently not
used by common OS schedulers
2) Hardware-aware optimized scheduler performing
decisions based on hardware resource usage and the
output of the workload analysis
- to improve processing units occupancy on
SMT/asymmetric processors
6
7. The big picture
Monitoring daemon
FPU
INT
Workload characterization
OS threads and processes
7
8. The big picture
FPU
Hardware-aware scheduler INT
Workload characterization
8
10. AMD Bulldozer
• AMD Bulldozer architecture
– Each CPU is implemented as a series of modules
(a.k.a. “cores”) with two cores (a.k.a. “processing
or SMT units”)
– Arithmetic-Logic Units (ALUs) are really available
per SMT unit
– A module is more similar to:
• A dual core when doing integer ops
• A single core with SMT=2 when
doing floating point ops
10
15. Workload characterization
• Is used to sort processes and threads that are
floating point intensive
– Among the X most running threads
• (where X = the number of cores available)
• Based on realtime monitoring system using
Hardware Performance Counters (HPCs)
15
16. …about HPCs…
• Registers embedded into processors to keep track
of hardware-related events such as cache misses,
number of CPU cycles, branch mispredictions,
etc.
• Very low overhead (about 1%)
• Extremely accurate
• Limited resources, only few of them can be used
at the same time
– This limits their wide adoption (yet) on large scale
• HW-specific
16
17. Workload characterization
• HPCs used:
– PERF_COUNT_HW_CPU_CYCLES: measures the
total number of CPU cycles consumed by a thread
during its execution time
– CYCLES_FPU_EMPTY: keeps track of the number
of CPU cycles the floating point units are not being
used by a thread during its execution time
– L2_CACHE_MISSES: counts the number of L2
cache misses generated by a thread during its
execution time
17
20. BulldOver design
• Server
– Daemon
– Scans the underlying architecture
– Time-based HPC monitoring (once per sec)
• We target scientific workloads, short-lived threads are
not well suitable
– Applies scheduling policies
– libHpcOverseer, hwloc, libpfm
20
21. BulldOver design
• Client
– Command-line tool
• prompt> bulldover java myprogram
– Traces the creation/termination of
threads/processes
– Share information through shared memory with
the server
– libmonitor, boost
21
24. Testing environment
• Dell PowerEdge M915
– 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8
modules each)
• Limited to 1 CPU with 8 cores/4 modules
– Test limited to a single NUMA node
• Avoiding latencies and other NUMA-related well known
effects
– Turbo mode and freq. scaling disabled
24
25. Benchmark suites
• SPEC CPU 2006
– Perfect match for evaluating Integer vs. Floating point
behaviors
• SciMark 2.0
– Java based
– Noisy environment (additional threads for garbage
collection, JIT, etc.)
– Mainly FPU-oriented, with different levels of stress
– Modified multi-threaded version running several
random benchmarks over a thread-pool
25
29. Results for SPEC CPU 2006
Running 4x Int and 4x FPU
benchmarks on a single NUMA
node (4 modules/8 cores)
Inefficient baseline
Improved scheduling
Default OS scheduling
29
30. Discussion
• BulldOver avoids the worst case scenario
– The default OS scheduler is not aware of the
workload characterization
• Benefits coming both from improved cache
usage AND better FPU/Integer units
occupancy
30
31. Results for Scimark 2.0
Running 8x randomly changing
over-time benchmarks on a
single NUMA node (4 modules/8
cores)
Default OS scheduling
Improved scheduling
31
32. Discussion
• All the threads are FPU-intensive
– But at different levels
• Still a reasonable speedup “for free”
• Dynamic adaptation, since the FPU usage
intensity varies over time
– BulldOver reacts accordingly
32
33. Conclusions
- We show how thread scheduling not aware of the shared
HW resources available on the AMD Bulldozer processor
can incur a significant performance penalty
- We presented a monitoring system that is able to
characterize the most active threads according to their
FPU/Integer usage
- Thanks to the realtime analysis, improved scheduling can
be applied and performance improved
- Our system is very low intrusive:
- Low overhead (below 2%)
- No kernel patching required
- No code instrumentation
- Works on any application
33
34. Conclusions
• Currently tuned for a specific HW architecture
• Good for scientific workloads
– Sampling rate is required (1 sec in our case, could
be less but can’t be 0…)
• Based on a very simple scheduling policy
– More sophisticated policies could be used
34
36. “Pow7Over”
• Work in progress on IBM Power7 processors
– 1 CPU, 8 cores, up to 4 SMT units per core
– Completely different…
• …operating system: RHEL 6.3
• …architecture: PowerPC
• …HPCs: IBM-specific ones (more than 500 available…)
• …compiler: autotools 6.0
• Similar approach
• Slightly less significant speedup
– But this is a full SMT
– Similar overall behavior both for the PUs and L2 caches
36