The document discusses challenges with existing memory managers and proposes solutions. Current memory managers are inadequate for high-performance applications on modern multicore architectures as they limit scalability and performance. The talk introduces the Heap Layers framework for building customizable memory managers. It also describes Hoard, a provably scalable memory manager that bounds local memory consumption by explicitly tracking utilization and moving free memory to a global heap. Finally, an extended memory manager called Reap is proposed for server applications.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Memory Management for High-Performance Applications
1. Memory Management
for High-Performance Applications
Emery Berger
University of Massachusetts Amherst
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
AMHERST
2. High-Performance Applications
Web servers,
search engines,
scientific codes cpu
cpu
cpu cpu RAM
cpu
cpu cpu RAM
cpu
C or C++
cpu RAM
cpu RAID drive
cpu Raid drive
cpu Raid drive
Run on one or
cluster of server
boxes software
compiler
Needs support at every level
runtime system
operating system
hardware
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2
AMHERST
3. New Applications,
Old Memory Managers
Applications and hardware have changed
Multiprocessors now commonplace
Object-oriented, multithreaded
Increased pressure on memory manager
(malloc, free)
But memory managers have not kept up
Inadequate support for modern applications
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 3
AMHERST
4. Current Memory Managers
Limit Scalability
As we add
Runtime Performance
processors, 14
13
program slows 12
Ideal
11
10
down Actual
9
Speedup
8
Caused by heap 7
6
5
contention 4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Processors
Larson server benchmark on 14-processor Sun
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 4
AMHERST
5. The Problem
Current memory managers
inadequate for high-performance
applications on modern architectures
Limit scalability & application
performance
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 5
AMHERST
6. This Talk
Building memory managers
Heap Layers framework
Problems with current memory managers
Contention, false sharing, space
Solution: provably scalable memory manager
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 6
AMHERST
7. Implementing Memory Managers
Memory managers must be
Space efficient
Very fast
Heavily-optimized C code
Hand-unrolled loops
Macros
Monolithic functions
Hard to write, reuse, or extend
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 7
AMHERST
8. Real Code: DLmalloc 2.7.2
#d e f i n e c h u n k s i z e ( p ) ( ( p ) - >s i z e & ~( S I ZE_BI TS ) )
#d e f i n e n e x t _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) )
#d e f i n e p r e v _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) - ( ( p ) - >p r e v _s i z e ) ) )
#d e f i n e c h u n k _ a t _ o f f s e t ( p , s ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) )
#d e f i n e i n u s e ( p )
( ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) +( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e ) & PREV_I NUS E)
#d e f i n e s e t _ i n u s e ( p )
( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e | = PREV_I NUS E
#d e f i n e c l e a r _ i n u s e ( p )
( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e &= ~( PREV_I NUS E)
#d e f i n e i n u s e _ b i t _ a t _ o f f s e t ( p , s )
( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e & PREV_I NUS E)
#d e f i n e s e t _ i n u s e _ b i t _ a t _ o f f s e t ( p , s )
( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e | = PREV_I NUS E)
#d e f i n e MAL L OC_ ZERO( c h a r p , n b y t e s )
do {
I NTERNAL _ S I ZE_ T* mz p = ( I NTERNAL_S I ZE_T* ) ( c h a r p ) ;
CHUNK_ S I ZE_ T mc t mp = ( n b y t e s ) /s i z e o f ( I NTERNAL_S I ZE_T) ;
l o n g mc n ;
i f ( mc t mp < 8 ) mc n = 0 ; e l s e { mc n = ( mc t mp - 1 ) /8 ; mc t mp %= 8 ; }
s wi t c h ( mc t mp ) {
c a s e 0 : f o r ( ; ; ) { * mz p ++ = 0 ;
c a s e 7: * mz p ++ = 0 ;
c a s e 6: * mz p ++ = 0 ;
c a s e 5: * mz p ++ = 0 ;
c a s e 4: * mz p ++ = 0 ;
c a s e 3: * mz p ++ = 0 ;
c a s e 2: * mz p ++ = 0 ;
c a s e 1: * mz p ++ = 0 ; i f ( mc n <= 0 ) b r e a k ; mc n - - ; }
}
} wh i l e ( 0 )
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 8
AMHERST
9. Programming Language Support
Classes Mixins
Overhead No overhead
Rigid hierarchy Flexible hierarchy
Sounds great...
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 9
AMHERST
10. A Heap Layer
C++ mixin with malloc & free methods
RedHeapLayer template <class SuperHeap>
class GreenHeapLayer :
public SuperHeap {…};
GreenHeapLayer
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 10
AMHERST
11. Example: Thread-Safe Heap Layer
LockedHeap
protect the superheap
with a lock
LockedMallocHeap
m a llocH ea p
L ockedH ea p
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 11
AMHERST
12. Empirical Results
Runtime (normalized to Lea allocator)
Heap Layers vs.
Kingsley KingsleyHeap Lea LeaHeap
Normalized Runtime
1.5
originals: 1.25
1
0.75
KingsleyHeap
0.5
0.25
vs. BSD allocator 0
cfrac espresso lindsay LRUsim perl roboop Average
Benchmark
LeaHeap
vs. DLmalloc 2.7 Space (normalized to Lea allocator)
Kingsley KingsleyHeap Lea LeaHeap
Competitive
Normalized Space
2.5
2
runtime and 1.5
1
memory efficiency 0.5
0
cfrac espresso lindsay LRUsim perl roboop Average
Benchmark
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 12
AMHERST
13. Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 13
AMHERST
14. Problems with General-Purpose
Memory Managers
Previous work for multiprocessors
Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92]
Impractical
Multiple heaps [Larson 98, Gloger 99]
Reduce contention but cause other problems:
P-fold or even unbounded increase in space
we show
Allocator-induced false sharing
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 14
AMHERST
15. Multiple Heap Allocator:
Pure Private Heaps
Key:
One heap per processor: = in use, processor 0
= free, on heap 1
gets memory
malloc
from its local heap
processor 0 processor 1
puts memory
free
x1= malloc(1)
on its local heap x2= malloc(1)
free(x1) free(x2)
x4= malloc(1)
x3= malloc(1)
STL, Cilk, ad hoc free(x3) free(x4)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 15
AMHERST
17. Multiple Heap Allocator:
Private Heaps with Ownership
processor 0 processor 1
returns memory
free
x1= malloc(1)
to original heap free(x1)
x2= malloc(1)
Bounded memory
free(x2)
consumption
No crash!
“Ptmalloc” (Linux),
LKmalloc
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 17
AMHERST
18. Problem:
P-fold Memory Blowup
Occurs in practice
processor 0 processor 1 processor 2
Round-robin producer- x1= malloc(1)
free(x1)
consumer x2= malloc(1)
free(x2)
processor i mod P allocates
x3=malloc(1)
processor (i+1) mod P frees
free(x3)
Footprint = 1 (2GB),
but space = 3 (6GB)
Exceeds 32-bit address space:
Crash!
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 18
AMHERST
19. Problem:
Allocator-Induced False Sharing
False sharing
CPU 0 CPU 1
Non-shared objects
on same cache line cache cache
Bane of parallel applications
bus
Extensively studied
cache line
processor 0 processor 1
All these allocators
x1= malloc(1) x2= malloc(1)
cause false sharing! thrash… thrash…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 19
AMHERST
20. So What Do We Do Now?
Where do we put free memory?
on central heap: Heap contention
on our own heap: Unbounded memory
(pure private heaps) consumption
on the original heap: P-fold blowup
(private heaps with ownership)
How do we avoid false sharing?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 20
AMHERST
21. Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 21
AMHERST
22. Hoard: Key Insights
Bound local memory consumption
Explicitly track utilization
Move free memory to a global heap
Provably bounds memory consumption
Manage memory in large chunks
Avoids false sharing
Reduces heap contention
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 22
AMHERST
23. Overview of Hoard
global heap
Manage memory in heap blocks
Page-sized
Avoids false sharing
Allocate from local heap block
Avoids heap contention
processor 0 processor P-1
Low utilization
…
Move heap block to global heap
Avoids space blowup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 23
AMHERST
24. Summary of Analytical Results
Space consumption: near optimal worst-case
Hoard: O(n log M/m + P) {P « n}
Optimal: O(n log M/m)
n = memory required
[Robson 70]
M = biggest object size
Private heaps with ownership: m = smallest object size
P = processors
O(P n log M/m)
Provably low synchronization
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 24
AMHERST
25. Empirical Results
Measure runtime on 14-processor Sun
Allocators
Solaris (system allocator)
Ptmalloc (GNU libc)
mtmalloc (Sun’s “MT-hot” allocator)
Micro-benchmarks
Threadtest: no sharing
Larson: sharing (server-style)
Cache-scratch: mostly reads & writes
(tests for false sharing)
Real application experience similar
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 25
AMHERST
26. Runtime Performance:
threadtest
Many
threads,
no sharing
Hoard
achieves
linear
speedup
speedup(x,P) = runtime(Solaris allocator, one processor)
/ runtime(x on P processors)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 26
AMHERST
27. Runtime Performance:
Larson
Many
threads,
sharing
(server-style)
Hoard
achieves
linear
speedup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 27
AMHERST
28. Runtime Performance:
false sharing
Many
threads,
mostly reads
& writes of
heap data
Hoard
achieves
linear
speedup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 28
AMHERST
29. Hoard in the “Real World”
Open source code
www.hoard.org
13,000 downloads
Solaris, Linux, Windows, IRIX, …
Widely used in industry
AOL, British Telecom, Novell, Philips
Reports: 2x-10x, “impressive” improvement in performance
Search server, telecom billing systems, scene rendering,
real-time messaging middleware, text-to-speech engine,
telephony, JVM
Scalable general-purpose memory manager
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 29
AMHERST
30. Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 30
AMHERST
31. Custom Memory Allocation
Replace new/delete, Very common practice
bypassing general-purpose Apache, gcc, lcc, STL,
allocator database servers…
Language-level
Reduce runtime – often
support in C++
Expand functionality – sometimes
Reduce space – rarely
“Use custom
allocators”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 31
AMHERST
32. The Reality
Lea allocator
Runtime - Custom Allocator Benchmarks
often as fast Custom Win32 DLmalloc
or faster 1.75
non-regions regions averages
Normalized Runtime
1.5
Custom
1.25
1
allocation 0.75
ineffective, 0.5
0.25
except for 0
regions.
ll
s
le
ze
r
ns
he
c
sim
r
c
ra
vp
se
on
lc
gc
l
ud
ee
io
ac
ve
5.
ar
gi
d-
6.
eg
m
br
17
ap
O
re
.p
xe
17
[OOPSLA 2002]
R
c-
7
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 32
AMHERST
33. Overview of Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 33
AMHERST
34. Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 34
AMHERST
35. Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 35
AMHERST
36. Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
Adapts to use (region or heap style)
Cheap deletion
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 36
AMHERST
37. Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 37
AMHERST
38. Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 38
AMHERST
39. Summary
Building memory managers
Heap Layers framework [PLDI 2001]
Problems with current memory managers
Contention, false sharing, space
Solution: provably scalable memory manager
Hoard [ASPLOS-IX]
Extended memory manager for servers
Reap [OOPSLA 2002]
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 39
AMHERST
40. Current Projects
CRAMM: Cooperative Robust Automatic Memory
Management
Garbage collection without paging
Automatic heap sizing
SAVMM: Scheduler-Aware Virtual Memory Management
Markov:
Programming language for building high-performance servers
COLA: Customizable Object Layout Algorithms
Improving locality in Java
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 40
AMHERST
41. www.cs.umass.edu/~plasma
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 41
AMHERST
43. Looking Forward
“New” programming languages
Increasing use of Java = garbage collection
New architectures
NUMA: SMT/CMP (“hyperthreading”)
Technology trends
Memory hierarchy
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 43
AMHERST
44. The Ever-Steeper
Memory Hierarchy
Higher = smaller, faster, closer to CPU
A real desktop machine (mine)
registers 8 integer, 8 floating-point; 1-cycle latency
L1 cache 8K data & instructions; 2-cycle latency
L2 cache 512K; 7-cycle latency
RAM 1GB; 100 cycle latency
Disk 40 GB; 38,000,000 cycle latency (!)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 44
AMHERST
45. Swapping & Throughput
Heap > available memory - throughput plummets
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 45
AMHERST
46. Why Manage Memory At All?
Just buy more!
Simplifies memory management
Still have to collect garbage eventually…
Workload fits in RAM = no more swapping!
Sounds great…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 46
AMHERST
47. Memory Prices Over Time
RAM Prices Over Time
(1977 dollars)
$10,000.00
$1,000.00
2K
$100.00
8K
Dollars per GB
32K
$10.00 128K
conventional DRAM
512K
2M
$1.00
8M
$0.10
$0.01
1977
1980
1981
1982
1985
1986
1987
1989
1990
1991
1992
1993
1994
1995
1997
1998
1999
2000
2002
2003
2004
2005
1978
1979
1983
1984
1988
1996
2001 Year
“Soon it will be free…”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 47
AMHERST
48. Memory Prices: Inflection Point!
RAM Prices Ov er Time
(1977 dollars)
$10,000.00
$1,000.00
2K
8K
$100.00
32K
Dollars per GB
128K
$10.00 512K
S DRA M ,
conventional DRAM R DR A M ,
2M
DDR ,
Chipkill 8M
$1.00
512M
1G
$0.10
$0.01
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
Year
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 48
AMHERST
49. Memory Is Actually Expensive
Desktops:
Most ship with 256MB
1GB = 50% more $$
Laptops = 70%, if possible
Limited capacity
Servers:
Buy 4GB, get 1 CPU
free!
Sun Enterprise 10000:
8GB extra = $150,000!
8GB Sun RAM =
Fast RAM – new
technologies 1 Ferrari Modena
Cosmic rays…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 49
AMHERST
50. Key Problem: Paging
Garbage collectors: VM oblivious
GC disrupts LRU queue
Touches non-resident pages
Virtual memory managers: GC oblivious
Likely to evict pages needed by GC
Paging
Orders of magnitude more time than RAM
BIG hit in performance and LONG pauses
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 50
AMHERST
51. Cooperative Robust Automatic
Memory Management (CRAMM)
Garbage collector Virtual memory manager
I’m a
cooperative
application!
Coarse-grained
change in
(heap-level)
memory pressure
Tracks per-process,
new heap size
Adjusts heap size overall
memory utilization
Fine-grained
page eviction
(page-level)
notification
Evacuates pages Page replacement
victim page(s)
Selects victim pages
Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 51
AMHERST
52. Fine-Grained Cooperative GC
Garbage collector Virtual memory manager
Fine-grained page eviction
notification
Evacuates pages Page replacement
victim page(s)
Selects victim pages
Goal: GC triggers no additional paging
Key ideas:
Adapt collection strategy on-the-fly
Page-oriented memory management
Exploit detailed page information from VM
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 52
AMHERST
53. Summary
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Future directions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 53
AMHERST
54. If You Have to Spend $$...
more Ferraris: good
more memory: bad
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 54
AMHERST
56. This Page Intentionally Left Blank
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 56
AMHERST
57. Virtual Memory Manager Support
New VM required: detailed page-level information
“Segmented queue” for low-overhead
unprotected protected
Local LRU order per-process, not gLRU (Linux)
Complementary to SAVM work:
“Scheduler-Aware Virtual Memory manager”
Under development – modified Linux kernel
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 57
AMHERST
58. Current Work: Robust
Performance
Currently: no VM-GC communicaton
BAD interactions under memory pressure
Our approach (with Eliot Moss, Scott Kaplan):
Cooperative Robust Automatic Memory
Management
LRU queue
memory pressure
Virtual Garbage
memory collector
empty pages
manager / allocator
reduced impact
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 58
AMHERST
59. Current Work: Predictable VMM
Recent work on scheduling for QoS
E.g., proportional-share
Under memory pressure, VMM is scheduler
Paged-out processes may never recover
Intermittent processes may wait long time
Scheduler-faithful virtual memory
(with Scott Kaplan, Prashant Shenoy)
Based on page value rather than order
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 59
AMHERST
60. Conclusion
Memory management for high-performance applications
Heap Layers framework [PLDI 2001]
Reusable components, no runtime cost
Hoard scalable memory manager [ASPLOS-IX]
High-performance, provably scalable & space-efficient
Reap hybrid memory manager [OOPSLA 2002]
Provides speed & robustness for server applications
Current work: robust memory management for
multiprogramming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 60
AMHERST
61. The Obligatory URL Slide
http://www.cs.umass.edu/~emery
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 61
AMHERST
62. If You Can Read This,
I Went Too Far
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 62
AMHERST
63. Hoard: Under the Hood
S ystem Heap
get or return memory to global heap
HeapBlockManager
LockedHeap
HeapBlockManager
HeapBlockManager
S uperblockHeap
malloc from local heap,
LockedHeap Empty
LockedHeap
LockedHeap
free to heap block
Heap Blocks
P erP rocessorHeap FreeT oHeapBlock
Large
objects
MallocOrF reeHeap
(> 4K)
S electS izeHeap
select heap based on size
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 63
AMHERST
64. Custom Memory Allocation
Replace new/delete, Very common practice
bypassing general-purpose Apache, gcc, lcc, STL,
allocator database servers…
Language-level
Reduce runtime – often
support in C++
Expand functionality – sometimes
Reduce space – rarely
“Use custom
allocators”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 64
AMHERST
65. Drawbacks of Custom Allocators
Avoiding memory manager means:
More code to maintain & debug
Can’t use memory debuggers
Not modular or robust:
Mix memory from custom
and general-purpose allocators → crash!
Increased burden on programmers
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 65
AMHERST
66. Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 66
AMHERST
67. (1) Per-Class Allocators
Recycle freed objects from a free list
a = new Class1; Class1
Fast
free list +
b = new Class1;
c = new Class1; Linked list operations
+
a
delete a;
Simple
+
delete b;
Identical semantics
b +
delete c;
C++ language support
+
a = new Class1; c
Possibly space-inefficient
-
b = new Class1;
c = new Class1;
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 67
AMHERST
68. (II) Custom Patterns
Tailor-made to fit allocation patterns
Example: 197.parser (natural language parser)
db
a c
char[MEMORY_LIMIT]
end_of_array end_of_array
end_of_array
end_of_array
end_of_array
a = xalloc(8); Fast
+
b = xalloc(16);
Pointer-bumping allocation
+
c = xalloc(8);
- Brittle
xfree(b);
- Fixed memory size
xfree(c);
d = xalloc(8); - Requires stack-like lifetimes
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 68
AMHERST
69. (III) Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 69
AMHERST
70. Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 70
AMHERST
71. Custom Allocators Are Faster…
Runtime - Custom Allocator Benchmarks
Custom Win32
1.75
non-regions regions averages
Normalized Runtime
1.5
1.25
1
0.75
0.5
0.25
0
s
r
er
he
ll
ll e
ze
m
c
ns
c
vp
on
ra
gc
lc
rs
si
ud
ac
ee
io
5.
ve
gi
6.
d-
pa
eg
m
17
ap
br
-re
O
17
xe
7.
R
c-
on
bo
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 71
AMHERST
72. Not So Fast…
Runtime - Custom Allocator Benchmarks
Custom Win32 DLmalloc
1.75
non-regions regions averages
Normalized Runtime
1.5
1.25
1
0.75
0.5
0.25
0
l
s
l le
s
ze
r
he
c
er
sim
al
c
vp
n
on
lc
gc
r
ud
rs
io
ee
ac
ve
5.
d-
6.
i
g
pa
eg
m
br
17
ap
O
re
17
xe
7.
R
c-
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 72
AMHERST
73. The Lea Allocator (DLmalloc 2.7.0)
Optimized for common allocation patterns
Per-size quicklists ≈ per-class allocation
Deferred coalescing
(combining adjacent free objects)
Highly-optimized fastpath
Space-efficient
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 73
AMHERST
74. Space Consumption Results
Space - Custom Allocator Benchmarks
Original DLmalloc
1.75
non-regions regions averages
Normalized Space
1.5
1.25
1
0.75
0.5
0.25
0
ll
lle
s
c
r
e
s
e
er
c
im
ra
vp
lc
n
on
z
ch
c
ud
rs
io
ee
.g
-s
ve
5.
a
i
g
pa
eg
6
ed
m
br
17
ap
O
re
17
7.
R
c-
x
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 74
AMHERST
75. Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 75
AMHERST
76. Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 76
AMHERST
77. Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 77
AMHERST
78. Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
+
Adapts to use (region or heap style)
+
Cheap deletion
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 78
AMHERST
79. Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 79
AMHERST
80. Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 80
AMHERST
81. Conclusion
Empirical study of custom allocators
Lea allocator often as fast or faster
Custom allocation ineffective,
except for regions
Reaps:
Nearly matches region performance
without other drawbacks
Take-home message:
Stop using custom memory allocators!
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 81
AMHERST
83. Experimental Methodology
Comparing to general-purpose allocators
Same semantics: no problem
E.g., disable per-class allocators
Different semantics: use emulator
Uses general-purpose allocator
but adds bookkeeping
regionfree: Free all associated objects
Other functionality (nesting, obstacks)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 83
AMHERST
84. Use Custom Allocators?
Strongly recommended by practitioners
Little hard data on performance/space
improvements
Only one previous study [Zorn 1992]
Focused on just one type of allocator
Custom allocators: waste of time
Small gains, bad allocators
Different allocators better? Trade-offs?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 84
AMHERST
85. Kinds of Custom Allocators
Three basic types of custom allocators
Per-class
Fast
Custom patterns
Fast, but very special-purpose
Regions
Fast, possibly more space-efficient
Convenient
Variants: nested, obstacks
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 85
AMHERST
86. Optimization Opportunity
Time Spent in Memory Operations
Memory Operations Other
100
80
% of runtime
60
40
20
0
lcc
ll e
sim
cc
e
ze
e
pr
r
se
ag
h
ud
v
g
ee
ac
5.
d-
6.
ar
er
m
ap
br
17
xe
17
p
Av
7.
c-
bo
19
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 86
AMHERST
88. Custom Memory Allocation
Programmers often replace malloc/free
Attempt to increase performance
Provide extra functionality (e.g., for servers)
Reduce space (rarely)
Empirical study of custom allocators
Lea allocator often as fast or faster
Custom allocation ineffective,
except for regions. [OOPSLA 2002]
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 88
AMHERST
89. Overview of Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 89
AMHERST
90. Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 90
AMHERST
91. Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 91
AMHERST
92. Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
Adapts to use (region or heap style)
Cheap deletion
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 92
AMHERST
93. Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 93
AMHERST
94. Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 94
AMHERST