3. Overview
• A Stop-the-World Collector performs garbage
collection while the application is completely
stopped
• A Parallel Collection uses multiple threads to
perform Garbage Collection
Parallel Scavenge example available in
OpenJDK7
4. Problem Statement
Stop-the-world (STW) algorithm degrades badly beyond
8 – cores on a 48-core NUMA-machine with OpenJDK 7:
– Does the Stop-the-World design has intrinsic
limitations?
– If no what are the limitations of the STW approach?
– How we can improve the current design?
7. Contended locks: GC monitor’s lock
The end of parallel phase
GC monitor’s lock
Global
counter
Solution: remove redundant synchronization
use timestamps to avoid race conditions
8. Contended locks: GC monitor’s lock
Idea: remove GC monitor’s lock
1. Task queue
Use lock-free task queue
2. Barrier at the end of parallel phase
Remove redundant synchronization
3. Conditional variable of the GC monitor
Replace conditional variable with Linux’s
futex_wait calls.
9. Lack of NUMA-awareness
Memory Memory
CPU CPU CPU CPU
NUMA – Non-Uniform Memory access
• Memory access imbalance
• Memory locality
10. Lack of NUMA-awareness
• Interleaved spaces
– map pages from different nodes with round robin
policy
• Fragmented spaces
– thread allocates memory from the fragment
associated with the node where it is executing
• Segregated spaces
– Fragmented space that is restricted to being
accessed by GC threads running on the same node
Best performance: fragmented spaces in the young space interleaved
in others
11. Results
Resulting GC, NAPS for NUMA-Aware Parallel Scavenge
Look at the effect of the optimization on 3
benchmarks:
• SPECjbb2005
• SPECjvm2008
• DeCapo
8 memory nodes, 48 cores, 96 GB RAM, Linux 3.0 64-bit
12. Results
• NAPS improves performance and scalability over
Parallel Scavenge all most in all cases
• NAPS performance continue to increase up to 48
cores
• NAPS reduces pause time up to 2.8 times in the best
case
• NAPS improves responsiveness of applications