SlideShare uma empresa Scribd logo
1 de 94
Memory Management
for High-Performance Applications
                      Emery Berger
       University of Massachusetts Amherst




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
                                AMHERST
High-Performance Applications
    Web servers,


    search engines,
    scientific codes                                               cpu
                                                                 cpu
                                                               cpu cpu     RAM
                                                                 cpu
                                                               cpu cpu   RAM
                                                                 cpu

    C or C++
                                                                   cpu RAM
                                                              cpu                     RAID drive
                                                                 cpu               Raid drive
                                                               cpu             Raid drive




    Run on one or


    cluster of server
    boxes                                                    software
                                                             compiler
    Needs support at every level
                                                         runtime system
                                                         operating system
                                                             hardware


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                            2
                                AMHERST
New Applications,
Old Memory Managers

    Applications and hardware have changed


         Multiprocessors now commonplace
     


         Object-oriented, multithreaded
     


         Increased pressure on memory manager
     

         (malloc, free)

    But memory managers have not kept up


         Inadequate support for modern applications
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   3
                                AMHERST
Current Memory Managers
    Limit Scalability

    As we add

                                                       Runtime Performance
    processors,                           14
                                          13

    program slows                         12
                                                       Ideal
                                          11
                                          10
    down                                               Actual
                                           9
                                Speedup
                                           8

    Caused by heap                         7
                                          6
                                           5
    contention                             4
                                           3
                                           2
                                           1
                                           0
                                               1   2    3      4   5   6   7   8   9   10   11   12   13   14
                                                                   Number of Processors


                              Larson server benchmark on 14-processor Sun
       UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                                     4
                                   AMHERST
The Problem

    Current memory managers


    inadequate for high-performance
    applications on modern architectures
        Limit scalability & application
    

        performance




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   5
                                AMHERST
This Talk
    Building memory managers


         Heap Layers framework
     


    Problems with current memory managers


         Contention, false sharing, space
     


    Solution: provably scalable memory manager


         Hoard
     


    Extended memory manager for servers


         Reap
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   6
                                AMHERST
Implementing Memory Managers
    Memory managers must be


         Space efficient
     

         Very fast
     



    Heavily-optimized C code


         Hand-unrolled loops
     

         Macros
     

         Monolithic functions
     


    Hard to write, reuse, or extend


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   7
                                AMHERST
Real Code: DLmalloc 2.7.2
#d e f i n e c h u n k s i z e ( p )                    ( ( p ) - >s i z e & ~( S I ZE_BI TS ) )
#d e f i n e n e x t _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) )
#d e f i n e p r e v _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) - ( ( p ) - >p r e v _s i z e ) ) )
#d e f i n e c h u n k _ a t _ o f f s e t ( p , s ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) )
#d e f i n e i n u s e ( p ) 
( ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) +( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e ) & PREV_I NUS E)
#d e f i n e s e t _ i n u s e ( p ) 
( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e | = PREV_I NUS E
#d e f i n e c l e a r _ i n u s e ( p ) 
( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e &= ~( PREV_I NUS E)
#d e f i n e i n u s e _ b i t _ a t _ o f f s e t ( p , s ) 
  ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e & PREV_I NUS E)
#d e f i n e s e t _ i n u s e _ b i t _ a t _ o f f s e t ( p , s ) 
  ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e | = PREV_I NUS E)
#d e f i n e MAL L OC_ ZERO( c h a r p , n b y t e s )                                                                          
do {                                                                                                                            
    I NTERNAL _ S I ZE_ T* mz p = ( I NTERNAL_S I ZE_T* ) ( c h a r p ) ;                                                       
    CHUNK_ S I ZE_ T mc t mp = ( n b y t e s ) /s i z e o f ( I NTERNAL_S I ZE_T) ;                                             
    l o n g mc n ;                                                                                                              
    i f ( mc t mp < 8 ) mc n = 0 ; e l s e { mc n = ( mc t mp - 1 ) /8 ; mc t mp %= 8 ; }                                       
    s wi t c h ( mc t mp ) {                                                                                                    
        c a s e 0 : f o r ( ; ; ) { * mz p ++ = 0 ;                                                                             
        c a s e 7:                         * mz p ++ = 0 ;                                                                      
        c a s e 6:                         * mz p ++ = 0 ;                                                                      
        c a s e 5:                         * mz p ++ = 0 ;                                                                      
        c a s e 4:                         * mz p ++ = 0 ;                                                                      
        c a s e 3:                         * mz p ++ = 0 ;                                                                      
        c a s e 2:                         * mz p ++ = 0 ;                                                                      
        c a s e 1:                         * mz p ++ = 0 ; i f ( mc n <= 0 ) b r e a k ; mc n - - ; }                           
    }                                                                                                                           
} wh i l e ( 0 )

         UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                                                       8
                                     AMHERST
Programming Language Support
     Classes                                 Mixins
                                      

          Overhead                                No overhead
                                            


          Rigid hierarchy                         Flexible hierarchy
                                            


                                                  Sounds great...
                                             




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   9
                                AMHERST
A Heap Layer
     C++ mixin with malloc & free methods




     RedHeapLayer                   template <class SuperHeap>
                                    class GreenHeapLayer :
                                      public SuperHeap {…};




    GreenHeapLayer




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   10
                                AMHERST
Example: Thread-Safe Heap Layer

LockedHeap
  protect the superheap
   with a lock
LockedMallocHeap

    m a llocH ea p



    L ockedH ea p




          UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   11
                                      AMHERST
Empirical Results
                                                                                     Runtime (normalized to Lea allocator)


    Heap Layers vs.
                                                                                     Kingsley     KingsleyHeap      Lea     LeaHeap




                                          Normalized Runtime
                                                                1.5

    originals:                                                 1.25
                                                                     1
                                                               0.75
        KingsleyHeap
                                                               0.5
                                                               0.25

        vs. BSD allocator                                            0
                                                                          cfrac    espresso    lindsay    LRUsim      perl      roboop   Average
                                                                                                         Benchmark
        LeaHeap
    

        vs. DLmalloc 2.7                                                             Space (normalized to Lea allocator)

                                                                                      Kingsley    KingsleyHeap       Lea     LeaHeap


    Competitive

                                       Normalized Space




                                                               2.5
                                                                2

    runtime and                                                1.5

                                                                1

    memory efficiency                                          0.5
                                                                0
                                                                         cfrac    espresso    lindsay    LRUsim      perl      roboop    Average
                                                                                                        Benchmark



            UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                                                                   12
                                        AMHERST
Overview
    Building memory managers


         Heap Layers framework
     


    Problems with memory managers


         Contention, space, false sharing
     


    Solution: provably scalable allocator


         Hoard
     


    Extended memory manager for servers


         Reap
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   13
                                AMHERST
Problems with General-Purpose
      Memory Managers
          Previous work for multiprocessors
     

               Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92]
          

                     Impractical
                 


               Multiple heaps [Larson 98, Gloger 99]
          




          Reduce contention but cause other problems:
     

               P-fold or even unbounded increase in space
          
we show
               Allocator-induced false sharing
          




              UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   14
                                          AMHERST
Multiple Heap Allocator:
     Pure Private Heaps
                                                     Key:
    One heap per processor:                                         = in use, processor 0

                                                                    = free, on heap 1
                gets memory
        malloc
    


        from its local heap
                                                      processor 0        processor 1
              puts memory
        free
    
                                                      x1= malloc(1)

        on its local heap                             x2= malloc(1)
                                                      free(x1)             free(x2)


                                                                           x4= malloc(1)
                                                      x3= malloc(1)

    STL, Cilk, ad hoc                                 free(x3)             free(x4)





         UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science               15
                                     AMHERST
Problem:
    Unbounded Memory Consumption
                                                   processor 0         processor 1
    Producer-consumer:

                                                  x1= malloc(1)
                                                                       free(x1)
        Processor 0 allocates
    
                                                  x2= malloc(1)

        Processor 1 frees                                              free(x2)
    
                                                  x3= malloc(1)
                                                                       free(x3)



    Unbounded memory


    consumption
        Crash!
    




         UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science        16
                                     AMHERST
Multiple Heap Allocator:
    Private Heaps with Ownership
                                                processor 0           processor 1
           returns memory
   free
                                                x1= malloc(1)
    to original heap                                                  free(x1)
                                                x2= malloc(1)

    Bounded memory
                                                                     free(x2)


    consumption
        No crash!
    


    “Ptmalloc” (Linux),


    LKmalloc

        UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science        17
                                    AMHERST
Problem:
         P-fold Memory Blowup

    Occurs in practice
                                             processor 0       processor 1         processor 2

    Round-robin producer-                      x1= malloc(1)

                                                                 free(x1)
    consumer                                                     x2= malloc(1)
                                                                                     free(x2)
        processor i mod P allocates
                                                                                    x3=malloc(1)

        processor (i+1) mod P frees
    
                                                 free(x3)



    Footprint = 1 (2GB),


    but space = 3 (6GB)
        Exceeds 32-bit address space:
    

        Crash!
             UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                 18
                                         AMHERST
Problem:
        Allocator-Induced False Sharing
    False sharing
                                                             CPU 0            CPU 1




        Non-shared objects
    

        on same cache line                                    cache            cache



        Bane of parallel applications
    
                                                                      bus

        Extensively studied
    
                                                                      cache line




                                                        processor 0            processor 1
    All these allocators

                                                        x1= malloc(1)         x2= malloc(1)

    cause false sharing!                                thrash…               thrash…



          UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science               19
                                      AMHERST
So What Do We Do Now?
    Where do we put free memory?


        on central heap:                                Heap contention
                                                   

        on our own heap:                                Unbounded memory
                                                   

        (pure private heaps)                            consumption
        on the original heap:                           P-fold blowup
                                                   

        (private heaps with ownership)

    How do we avoid false sharing?




        UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   20
                                    AMHERST
Overview
    Building memory managers


         Heap Layers framework
     


    Problems with memory managers


         Contention, space, false sharing
     


    Solution: provably scalable allocator


         Hoard
     


    Extended memory manager for servers


         Reap
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   21
                                AMHERST
Hoard: Key Insights
    Bound local memory consumption


     Explicitly track utilization

     Move free memory to a global heap

     Provably bounds memory consumption




    Manage memory in large chunks


     Avoids false sharing

     Reduces heap contention


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   22
                                AMHERST
Overview of Hoard
                                                                       global heap
    Manage memory in heap blocks


        Page-sized
    

        Avoids false sharing
    



    Allocate from local heap block


        Avoids heap contention
    

                                                         processor 0                 processor P-1
    Low utilization



                                                                           …
    Move heap block to global heap


        Avoids space blowup
    



           UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                      23
                                       AMHERST
Summary of Analytical Results
    Space consumption: near optimal worst-case




         Hoard:           O(n log M/m + P) {P « n}
     

         Optimal:         O(n log M/m)
     
                                                           n = memory required
                          [Robson 70]
                                                           M = biggest object size
         Private heaps with ownership:                     m = smallest object size
     
                                                           P = processors
                     O(P n log M/m)

    Provably low synchronization




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science          24
                                AMHERST
Empirical Results
    Measure runtime on 14-processor Sun


         Allocators
     

           Solaris (system allocator)

           Ptmalloc (GNU libc)

           mtmalloc (Sun’s “MT-hot” allocator)


         Micro-benchmarks
     

              Threadtest:    no sharing
          


              Larson:        sharing (server-style)
          


              Cache-scratch: mostly reads & writes
          

                             (tests for false sharing)
    Real application experience similar



    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   25
                                AMHERST
Runtime Performance:
threadtest


                                                                     Many
                                                                 
                                                                     threads,
                                                                     no sharing
                                                                     Hoard
                                                                 
                                                                     achieves
                                                                     linear
                                                                     speedup




      speedup(x,P) = runtime(Solaris allocator, one processor)
                       / runtime(x on P processors)

 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science             26
                             AMHERST
Runtime Performance:
Larson


                                                                    Many
                                                                
                                                                    threads,
                                                                    sharing
                                                                    (server-style)
                                                                    Hoard
                                                                
                                                                    achieves
                                                                    linear
                                                                    speedup




 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science            27
                             AMHERST
Runtime Performance:
false sharing


                                                                    Many
                                                                
                                                                    threads,
                                                                    mostly reads
                                                                    & writes of
                                                                    heap data
                                                                    Hoard
                                                                
                                                                    achieves
                                                                    linear
                                                                    speedup




 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science          28
                             AMHERST
Hoard in the “Real World”
    Open source code


         www.hoard.org
     

         13,000 downloads
     

         Solaris, Linux, Windows, IRIX, …
     


    Widely used in industry


         AOL, British Telecom, Novell, Philips
     

         Reports: 2x-10x, “impressive” improvement in performance
     

         Search server, telecom billing systems, scene rendering,
     

         real-time messaging middleware, text-to-speech engine,
         telephony, JVM


    Scalable general-purpose memory manager

    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   29
                                AMHERST
Overview
    Building memory managers


         Heap Layers framework
     


    Problems with memory managers


         Contention, space, false sharing
     


    Solution: provably scalable allocator


         Hoard
     


    Extended memory manager for servers


         Reap
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   30
                                AMHERST
Custom Memory Allocation
    Replace new/delete,                               Very common practice
                                                 

    bypassing general-purpose                              Apache, gcc, lcc, STL,
                                                       

    allocator                                              database servers…
                                                           Language-level
        Reduce runtime – often                         
    

                                                           support in C++
        Expand functionality – sometimes
    


        Reduce space – rarely
    




“Use custom
  allocators”

        UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science        31
                                    AMHERST
The Reality
    Lea allocator
                                                     Runtime - Custom Allocator Benchmarks

    often as fast                                                      Custom   Win32      DLmalloc

    or faster                                  1.75
                                                         non-regions        regions                   averages
                         Normalized Runtime

                                                1.5
    Custom
                                              1.25
                                                  1
    allocation                                 0.75
    ineffective,                                0.5
                                               0.25
    except for                                    0

    regions.




                                                                                                        ll
                                                                                                       s
                                                                                                      le
                                                           ze


                                                             r




                                                                                                     ns
                                                          he


                                                                                      c
                                                         sim
                                                            r




                                                            c




                                                                                                     ra
                                                          vp
                                                         se




                                                                                                    on
                                                                                      lc
                                                        gc




                                                                                               l
                                                                                            ud
                                                       ee




                                                                                                   io
                                                       ac




                                                                                                 ve
                                                       5.
                                                       ar




                                                                                                 gi
                                                      d-




                                                      6.




                                                                                                eg
                                                                                           m
                                                    br

                                                   17




                                                   ap




                                                                                               O
                                                                                               re
                                                    .p

                                                   xe




                                                  17
    [OOPSLA 2002]




                                                                                              R
                                                 c-
                                                 7




                                                                                                 -
                                                bo




                                                                                              on
                                              19




                                                                                             N
          UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                                   32
                                      AMHERST
Overview of Regions
    Separate areas, deletion only en masse


    regioncreate(r)               r
    regionmalloc(r, sz)
    regiondelete(r)

                                                      - Risky
    Fast
+

                                                           - Accidental deletion
         Pointer-bumping allocation
     +

                                                           - Too much space
         Deletion of chunks
     +


    Convenient
+

         One call frees all memory
     +




           UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science    33
                                       AMHERST
Why Regions?
    Apparently faster, more space-efficient


    Servers need memory management support:


         Avoid resource leaks
     

              Tear down memory associated with terminated
          

              connections or transactions
         Current approach (e.g., Apache): regions
     




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   34
                                AMHERST
Drawbacks of Regions
    Can’t reclaim memory within regions


         Problem for long-running computations,
     

         producer-consumer patterns,
         off-the-shelf “malloc/free” programs
         unbounded memory consumption
     




    Current situation for Apache:


         vulnerable to denial-of-service
     


         limits runtime of connections
     


         limits module programming
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   35
                                AMHERST
Reap Hybrid Allocator
          Reap = region + heap
      

               Adds individual object deletion & heap
           

reapcreate(r)
                             r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)

          Can reduce memory consumption
      

          Fast
      

               Adapts to use (region or heap style)
           


               Cheap deletion
           

          UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   36
                                      AMHERST
Using Reap as Regions
                                     Runtime - Region-Based Benchmarks

                                Original   Win32    DLmalloc   WinHeap   Vmalloc   Reap
                                                 4.08
                      2.5
 Normalized Runtime




                        2


                      1.5


                        1


                      0.5


                        0
                                           lcc                           mudlle


Reap performance nearly matches regions
                      UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   37
                                                  AMHERST
Reap: Best of Both Worlds
    Combining new/delete with regions


    usually impossible:
         Incompatible API’s
     

         Hard to rewrite code
     




    Use Reap: Incorporate new/delete code into Apache


     “mod_bc” (arbitrary-precision calculator)

              Changed 20 lines (out of 8000)
          

         Benchmark: compute 1000th prime
     


              With Reap: 240K
          


              Without Reap: 7.4MB
          


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   38
                                AMHERST
Summary
    Building memory managers


         Heap Layers framework [PLDI 2001]
     


    Problems with current memory managers


         Contention, false sharing, space
     


    Solution: provably scalable memory manager


         Hoard [ASPLOS-IX]
     


    Extended memory manager for servers


         Reap [OOPSLA 2002]
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   39
                                AMHERST
Current Projects
    CRAMM: Cooperative Robust Automatic Memory


    Management
      Garbage collection without paging

      Automatic heap sizing



    SAVMM: Scheduler-Aware Virtual Memory Management



    Markov:


     Programming language for building high-performance servers



    COLA: Customizable Object Layout Algorithms


      Improving locality in Java




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   40
                                AMHERST
www.cs.umass.edu/~plasma




     UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   41
                                 AMHERST
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   42
                            AMHERST
Looking Forward
    “New” programming languages


         Increasing use of Java = garbage collection
     


    New architectures


         NUMA: SMT/CMP (“hyperthreading”)
     


    Technology trends


         Memory hierarchy
     




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   43
                                AMHERST
The Ever-Steeper
Memory Hierarchy
    Higher = smaller, faster, closer to CPU


         A real desktop machine (mine)
     




              registers           8 integer, 8 floating-point; 1-cycle latency

              L1 cache            8K data & instructions; 2-cycle latency

              L2 cache          512K; 7-cycle latency



                 RAM            1GB; 100 cycle latency



                 Disk           40 GB; 38,000,000 cycle latency (!)

    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science         44
                                AMHERST
Swapping & Throughput




    Heap > available memory - throughput plummets


      UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   45
                                  AMHERST
Why Manage Memory At All?
    Just buy more!


         Simplifies memory management
     

               Still have to collect garbage eventually…
           



    Workload fits in RAM = no more swapping!


    Sounds great…





    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   46
                                AMHERST
Memory Prices Over Time
                                     RAM Prices Over Time
                                          (1977 dollars)


                   $10,000.00



                    $1,000.00


                                                              2K
                     $100.00
                                                              8K
  Dollars per GB




                                                              32K
                      $10.00                                  128K
                                       conventional DRAM
                                                              512K
                                                              2M
                       $1.00
                                                              8M


                       $0.10



                       $0.01
                                1977


                                1980
                                1981
                                1982


                                1985
                                1986
                                1987

                                1989
                                1990
                                1991
                                1992
                                1993
                                1994
                                1995

                                1997
                                1998
                                1999
                                2000

                                2002
                                2003
                                2004
                                2005
                                1978
                                1979




                                1983
                                1984




                                1988




                                1996




                                2001            Year




                                “Soon it will be free…”
 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   47
                             AMHERST
Memory Prices: Inflection Point!
                                 RAM Prices Ov er Time
                                      (1977 dollars)


                   $10,000.00



                    $1,000.00
                                                                      2K
                                                                      8K
                     $100.00
                                                                      32K
  Dollars per GB




                                                                      128K
                      $10.00                                          512K
                                                         S DRA M ,
                                   conventional DRAM     R DR A M ,
                                                                      2M
                                                         DDR ,
                                                         Chipkill     8M
                       $1.00
                                                                      512M
                                                                      1G
                       $0.10



                       $0.01
                                1977
                                1978
                                1979
                                1980
                                1981
                                1982
                                1983
                                1984
                                1985
                                1986
                                1987
                                1988
                                1989
                                1990
                                1991
                                1992
                                1993
                                1994
                                1995
                                1996
                                1997
                                1998
                                1999
                                2000
                                2001
                                2002
                                2003
                                2004
                                2005
                                            Year




 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science        48
                             AMHERST
Memory Is Actually Expensive
    Desktops:

         Most ship with 256MB
     

         1GB = 50% more $$
     

         Laptops = 70%, if possible
     

              Limited capacity
          




    Servers:

         Buy 4GB, get 1 CPU
     

         free!
         Sun Enterprise 10000:
     

          8GB extra = $150,000!
                                                 8GB Sun RAM =
         Fast RAM – new
     

         technologies                              1 Ferrari Modena
         Cosmic rays…
     



     UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   49
                                 AMHERST
Key Problem: Paging
    Garbage collectors: VM oblivious


         GC disrupts LRU queue
     

         Touches non-resident pages
     


    Virtual memory managers: GC oblivious


         Likely to evict pages needed by GC
     




    Paging


         Orders of magnitude more time than RAM
     

         BIG hit in performance and LONG pauses
     

    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   50
                                AMHERST
Cooperative Robust Automatic
    Memory Management (CRAMM)
    Garbage collector                                 Virtual memory manager
                                       I’m a
                                    cooperative
                                    application!
      Coarse-grained
                                   change in
       (heap-level)
                                memory pressure
                                                          Tracks per-process,
                                  new heap size
     Adjusts heap size                                          overall
                                                           memory utilization

       Fine-grained
                                  page eviction
       (page-level)
                                   notification

      Evacuates pages                                       Page replacement
                                  victim page(s)
    Selects victim pages



     Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)


       UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science     51
                                   AMHERST
Fine-Grained Cooperative GC
Garbage collector                                  Virtual memory manager

    Fine-grained               page eviction
                                notification

  Evacuates pages                                        Page replacement
                               victim page(s)
Selects victim pages



    Goal: GC triggers no additional paging


    Key ideas:


         Adapt collection strategy on-the-fly
     

         Page-oriented memory management
     

         Exploit detailed page information from VM
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science    52
                                AMHERST
Summary
    Building memory managers


         Heap Layers framework
     


    Problems with memory managers


         Contention, space, false sharing
     


    Solution: provably scalable allocator


         Hoard
     


    Future directions




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   53
                                AMHERST
If You Have to Spend $$...




                                                more Ferraris: good
  more memory: bad


 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   54
                             AMHERST
www.cs.umass.edu/~emery/plasma
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   55
                            AMHERST
This Page Intentionally Left Blank




 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   56
                             AMHERST
Virtual Memory Manager Support
    New VM required: detailed page-level information


         “Segmented queue” for low-overhead
     



          unprotected             protected



    Local LRU order per-process, not gLRU (Linux)


         Complementary to SAVM work:
     

         “Scheduler-Aware Virtual Memory manager”
    Under development – modified Linux kernel





    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   57
                                AMHERST
Current Work: Robust
Performance
    Currently: no VM-GC communicaton


         BAD interactions under memory pressure
     


    Our approach (with Eliot Moss, Scott Kaplan):


    Cooperative Robust Automatic Memory
    Management
                             LRU queue
                          memory pressure
      Virtual                                             Garbage
     memory                                               collector
                            empty pages
     manager                                             / allocator
                           reduced impact


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   58
                                AMHERST
Current Work: Predictable VMM
    Recent work on scheduling for QoS


         E.g., proportional-share
     


         Under memory pressure, VMM is scheduler
     

              Paged-out processes may never recover
          


              Intermittent processes may wait long time
          



    Scheduler-faithful virtual memory


    (with Scott Kaplan, Prashant Shenoy)
      Based on page value rather than order




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   59
                                AMHERST
Conclusion
Memory management for high-performance applications
 Heap Layers framework [PLDI 2001]

          Reusable components, no runtime cost
      


    Hoard scalable memory manager [ASPLOS-IX]


          High-performance, provably scalable & space-efficient
      


    Reap hybrid memory manager [OOPSLA 2002]


          Provides speed & robustness for server applications
      




    Current work: robust memory management for


    multiprogramming
    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   60
                                AMHERST
The Obligatory URL Slide

http://www.cs.umass.edu/~emery




 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   61
                             AMHERST
If You Can Read This,
I Went Too Far




 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   62
                             AMHERST
Hoard: Under the Hood
                       S ystem Heap




                                                             get or return memory to global heap
             HeapBlockManager



                LockedHeap




    HeapBlockManager
  HeapBlockManager
S uperblockHeap


                                                             malloc from local heap,
      LockedHeap                            Empty
    LockedHeap
 LockedHeap
                                                                 free to heap block
                                          Heap Blocks



  P erP rocessorHeap            FreeT oHeapBlock


                                                    Large
                                                   objects
  MallocOrF reeHeap
                                                   (> 4K)




                       S electS izeHeap

                                                             select heap based on size
        UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                   63
                                    AMHERST
Custom Memory Allocation
    Replace new/delete,                               Very common practice
                                                 

    bypassing general-purpose                              Apache, gcc, lcc, STL,
                                                       

    allocator                                              database servers…
                                                           Language-level
        Reduce runtime – often                         
    

                                                           support in C++
        Expand functionality – sometimes
    


        Reduce space – rarely
    




“Use custom
  allocators”

        UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science        64
                                    AMHERST
Drawbacks of Custom Allocators
    Avoiding memory manager means:


         More code to maintain & debug
     


         Can’t use memory debuggers
     


         Not modular or robust:
     

              Mix memory from custom
          

              and general-purpose allocators → crash!


    Increased burden on programmers




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   65
                                AMHERST
Overview
    Introduction


    Perceived benefits and drawbacks


    Three main kinds of custom allocators


    Comparison with general-purpose allocators


    Advantages and drawbacks of regions


    Reaps – generalization of regions & heaps





    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   66
                                AMHERST
(1) Per-Class Allocators
    Recycle freed objects from a free list


    a = new Class1;                 Class1
                                                         Fast
                                    free list        +
    b = new Class1;
    c = new Class1;                                           Linked list operations
                                                          +
                                      a
    delete a;
                                                         Simple
                                                     +
    delete b;
                                                              Identical semantics
                                      b                   +
    delete c;
                                                              C++ language support
                                                          +
    a = new Class1;                    c
                                                         Possibly space-inefficient
                                                     -
    b = new Class1;
    c = new Class1;



         UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science          67
                                     AMHERST
(II) Custom Patterns
             Tailor-made to fit allocation patterns
         

                  Example: 197.parser (natural language parser)
              


                              db
                         a             c
char[MEMORY_LIMIT]

                       end_of_array end_of_array
                           end_of_array
                               end_of_array
                                  end_of_array
       a = xalloc(8);                       Fast
                                       +
       b = xalloc(16);
                                                 Pointer-bumping allocation
                                             +
       c = xalloc(8);
                                       - Brittle
       xfree(b);
                                             - Fixed memory size
       xfree(c);
       d = xalloc(8);                        - Requires stack-like lifetimes
             UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   68
                                         AMHERST
(III) Regions
    Separate areas, deletion only en masse


    regioncreate(r)               r
    regionmalloc(r, sz)
    regiondelete(r)

                                                      - Risky
    Fast
+

                                                           - Accidental deletion
         Pointer-bumping allocation
     +

                                                           - Too much space
         Deletion of chunks
     +


    Convenient
+

         One call frees all memory
     +




           UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science    69
                                       AMHERST
Overview
    Introduction


    Perceived benefits and drawbacks


    Three main kinds of custom allocators


    Comparison with general-purpose allocators


    Advantages and drawbacks of regions


    Reaps – generalization of regions & heaps





    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   70
                                AMHERST
Custom Allocators Are Faster…
                             Runtime - Custom Allocator Benchmarks

                                             Custom     Win32

                      1.75
                              non-regions    regions                             averages
 Normalized Runtime




                       1.5
                      1.25
                         1
                      0.75
                       0.5
                      0.25
                         0




                                                                           s
                                    r
                                  er




                                 he




                                                                                      ll
                                                                ll e
                                 ze
                                  m




                                   c




                                                                                     ns
                                                          c
                                vp




                                                                         on




                                                                                    ra
                               gc




                                                       lc
                               rs


                               si




                                                              ud
                              ac
                              ee




                                                                                  io
                              5.




                                                                                 ve
                                                                       gi
                            6.
                            d-
                           pa




                                                                               eg
                                                              m
                          17




                          ap
                           br




                                                                    -re




                                                                               O
                         17
                         xe
                         7.




                                                                           R
                        c-




                                                                  on
                      bo
                      19




                                                                  N


 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                       71
                             AMHERST
Not So Fast…
                               Runtime - Custom Allocator Benchmarks
                                                  Custom   Win32       DLmalloc

                       1.75
                                    non-regions        regions                            averages
  Normalized Runtime




                        1.5
                       1.25
                          1
                       0.75
                        0.5
                       0.25
                          0




                                                                                                   l
                                                                                   s
                                                                        l le




                                                                                          s
                                        ze


                                          r




                                        he


                                                                   c
                              er

                                       sim




                                                                                                 al
                                         c
                                       vp




                                                                                   n

                                                                                        on
                                                                 lc
                                      gc




                                                                                                  r
                                                                    ud
                            rs




                                                                                io
                                     ee




                                    ac




                                                                                               ve
                                    5.
                                   d-




                                    6.




                                                                                          i
                                                                               g
                          pa




                                                                                       eg
                                                                   m
                                  br

                                 17




                                 ap




                                                                                              O
                                                                            re
                                 17
                                 xe
                        7.




                                                                                   R
                                c-




                                                                            -
                               bo




                                                                         on
                       19




                                                                        N



 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                                  72
                             AMHERST
The Lea Allocator (DLmalloc 2.7.0)
    Optimized for common allocation patterns


         Per-size quicklists ≈ per-class allocation
     


    Deferred coalescing


    (combining adjacent free objects)
    Highly-optimized fastpath


    Space-efficient





    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   73
                                AMHERST
Space Consumption Results
                                    Space - Custom Allocator Benchmarks

                                                          Original        DLmalloc

                    1.75
                                     non-regions                    regions                           averages
Normalized Space




                     1.5
                    1.25
                       1
                    0.75
                     0.5
                    0.25
                       0




                                                                                                            ll
                                                                                 lle


                                                                                              s
                                                                            c
                                                      r
                                            e




                                                                                                      s
                                                                      e
                           er




                                                             c
                                     im




                                                                                                           ra
                                                    vp




                                                                           lc




                                                                                               n

                                                                                                    on
                                             z




                                                                   ch
                                                             c




                                                                                 ud
                        rs




                                                                                            io
                                          ee




                                                          .g
                                  -s




                                                                                                           ve
                                                  5.




                                                                    a




                                                                                                      i
                                                                                          g
                      pa




                                                                                                   eg
                                                         6
                             ed




                                                                                m
                                       br

                                                 17




                                                                 ap




                                                                                                          O
                                                                                       re
                                                      17
                    7.




                                                                                               R
                                     c-
                              x




                                                                                        -
                           bo




                                                                                     on
                   19




                                                                                 N


                   UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science                          74
                                               AMHERST
Overview
    Introduction


    Perceived benefits and drawbacks


    Three main kinds of custom allocators


    Comparison with general-purpose allocators


    Advantages and drawbacks of regions


    Reaps – generalization of regions & heaps





    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   75
                                AMHERST
Why Regions?
    Apparently faster, more space-efficient


    Servers need memory management support:


         Avoid resource leaks
     

              Tear down memory associated with terminated
          

              connections or transactions
         Current approach (e.g., Apache): regions
     




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   76
                                AMHERST
Drawbacks of Regions
    Can’t reclaim memory within regions


         Problem for long-running computations,
     

         producer-consumer patterns,
         off-the-shelf “malloc/free” programs
         unbounded memory consumption
     




    Current situation for Apache:


         vulnerable to denial-of-service
     


         limits runtime of connections
     


         limits module programming
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   77
                                AMHERST
Reap Hybrid Allocator
          Reap = region + heap
      

               Adds individual object deletion & heap
           

reapcreate(r)
                             r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)

          Can reduce memory consumption
      


          Fast
      +

               Adapts to use (region or heap style)
           +


               Cheap deletion
           +

          UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   78
                                      AMHERST
Using Reap as Regions
                                     Runtime - Region-Based Benchmarks

                                Original   Win32    DLmalloc   WinHeap   Vmalloc   Reap
                                                 4.08
                      2.5
 Normalized Runtime




                        2


                      1.5


                        1


                      0.5


                        0
                                           lcc                           mudlle


Reap performance nearly matches regions
                      UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   79
                                                  AMHERST
Reap: Best of Both Worlds
    Combining new/delete with regions


    usually impossible:
         Incompatible API’s
     

         Hard to rewrite code
     




    Use Reap: Incorporate new/delete code into Apache


     “mod_bc” (arbitrary-precision calculator)

              Changed 20 lines (out of 8000)
          

         Benchmark: compute 1000th prime
     


              With Reap: 240K
          


              Without Reap: 7.4MB
          


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   80
                                AMHERST
Conclusion
    Empirical study of custom allocators


         Lea allocator often as fast or faster
     


         Custom allocation ineffective,
     

         except for regions
    Reaps:


         Nearly matches region performance
     

         without other drawbacks
    Take-home message:


         Stop using custom memory allocators!
     




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   81
                                AMHERST
Software



http://www.cs.umass.edu/~emery

(part of Heap Layers distribution)




  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   82
                              AMHERST
Experimental Methodology
    Comparing to general-purpose allocators


         Same semantics: no problem
     

              E.g., disable per-class allocators
          


         Different semantics: use emulator
     

            Uses general-purpose allocator
          

            but adds bookkeeping
          regionfree: Free all associated objects
           Other functionality (nesting, obstacks)




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   83
                                AMHERST
Use Custom Allocators?
    Strongly recommended by practitioners


    Little hard data on performance/space


    improvements
         Only one previous study [Zorn 1992]
     

         Focused on just one type of allocator
     

         Custom allocators: waste of time
     

              Small gains, bad allocators
          


    Different allocators better? Trade-offs?




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   84
                                AMHERST
Kinds of Custom Allocators
    Three basic types of custom allocators


         Per-class
     

              Fast
          


         Custom patterns
     

              Fast, but very special-purpose
          


         Regions
     

              Fast, possibly more space-efficient
          


              Convenient
          


              Variants: nested, obstacks
          




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   85
                                AMHERST
Optimization Opportunity
                               Time Spent in Memory Operations

                                       Memory Operations            Other

                 100
                    80
% of runtime




                    60
                    40
                    20
                    0




                                                                       lcc


                                                                               ll e
                              sim




                                                        cc




                                                                                        e
                                      ze




                                                                 e
                                                pr
                       r
                     se




                                                                                      ag
                                                                h




                                                                             ud
                                               v

                                                        g
                                      ee




                                                             ac
                                            5.
                            d-




                                                     6.
                    ar




                                                                                   er
                                                                             m
                                                            ap
                                    br

                                           17
                          xe




                                                   17
                   p




                                                                                 Av
                7.




                                 c-
                         bo
               19




               UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science         86
                                           AMHERST
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   87
                            AMHERST
Custom Memory Allocation
    Programmers often replace malloc/free


         Attempt to increase performance
     

         Provide extra functionality (e.g., for servers)
     

         Reduce space (rarely)
     




    Empirical study of custom allocators


         Lea allocator often as fast or faster
     

         Custom allocation ineffective,
     

         except for regions. [OOPSLA 2002]
    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   88
                                AMHERST
Overview of Regions
    Separate areas, deletion only en masse


    regioncreate(r)               r
    regionmalloc(r, sz)
    regiondelete(r)

                                                      - Risky
    Fast
+

                                                           - Accidental deletion
         Pointer-bumping allocation
     +

                                                           - Too much space
         Deletion of chunks
     +


    Convenient
+

         One call frees all memory
     +




           UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science    89
                                       AMHERST
Why Regions?
    Apparently faster, more space-efficient


    Servers need memory management support:


         Avoid resource leaks
     

              Tear down memory associated with terminated
          

              connections or transactions
         Current approach (e.g., Apache): regions
     




    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   90
                                AMHERST
Drawbacks of Regions
    Can’t reclaim memory within regions


         Problem for long-running computations,
     

         producer-consumer patterns,
         off-the-shelf “malloc/free” programs
         unbounded memory consumption
     




    Current situation for Apache:


         vulnerable to denial-of-service
     


         limits runtime of connections
     


         limits module programming
     


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   91
                                AMHERST
Reap Hybrid Allocator
          Reap = region + heap
      

               Adds individual object deletion & heap
           

reapcreate(r)
                             r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)

          Can reduce memory consumption
      

          Fast
      

               Adapts to use (region or heap style)
           


               Cheap deletion
           

          UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   92
                                      AMHERST
Using Reap as Regions
                                     Runtime - Region-Based Benchmarks

                                Original   Win32    DLmalloc   WinHeap   Vmalloc   Reap
                                                 4.08
                      2.5
 Normalized Runtime




                        2


                      1.5


                        1


                      0.5


                        0
                                           lcc                           mudlle


Reap performance nearly matches regions
                      UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   93
                                                  AMHERST
Reap: Best of Both Worlds
    Combining new/delete with regions


    usually impossible:
         Incompatible API’s
     

         Hard to rewrite code
     




    Use Reap: Incorporate new/delete code into Apache


     “mod_bc” (arbitrary-precision calculator)

              Changed 20 lines (out of 8000)
          

         Benchmark: compute 1000th prime
     


              With Reap: 240K
          


              Without Reap: 7.4MB
          


    UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science   94
                                AMHERST

Mais conteúdo relacionado

Semelhante a Memory Management for High-Performance Applications

Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
MySQLConference
 
Operating Systems - Distributed Parallel Computing
Operating Systems - Distributed Parallel ComputingOperating Systems - Distributed Parallel Computing
Operating Systems - Distributed Parallel Computing
Emery Berger
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 

Semelhante a Memory Management for High-Performance Applications (20)

Processes and Threads
Processes and ThreadsProcesses and Threads
Processes and Threads
 
Operating Systems - Architecture
Operating Systems - ArchitectureOperating Systems - Architecture
Operating Systems - Architecture
 
Workflowsim escience12
Workflowsim escience12Workflowsim escience12
Workflowsim escience12
 
OSTEP Chapter2 Introduction
OSTEP Chapter2 IntroductionOSTEP Chapter2 Introduction
OSTEP Chapter2 Introduction
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
 
Operating Systems - Distributed Parallel Computing
Operating Systems - Distributed Parallel ComputingOperating Systems - Distributed Parallel Computing
Operating Systems - Distributed Parallel Computing
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
Crash Dump Analysis 101
Crash Dump Analysis 101Crash Dump Analysis 101
Crash Dump Analysis 101
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
MySQL Tuning
MySQL TuningMySQL Tuning
MySQL Tuning
 
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory Course
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory CourseRuby on Rails 101 - Presentation Slides for a Five Day Introductory Course
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory Course
 
php & performance
 php & performance php & performance
php & performance
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 

Mais de Emery Berger

Dthreads: Efficient Deterministic Multithreading
Dthreads: Efficient Deterministic MultithreadingDthreads: Efficient Deterministic Multithreading
Dthreads: Efficient Deterministic Multithreading
Emery Berger
 
Programming with People
Programming with PeopleProgramming with People
Programming with People
Emery Berger
 
Stabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance EvaluationStabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance Evaluation
Emery Berger
 
Operating Systems - Advanced File Systems
Operating Systems - Advanced File SystemsOperating Systems - Advanced File Systems
Operating Systems - Advanced File Systems
Emery Berger
 
Operating Systems - Queuing Systems
Operating Systems - Queuing SystemsOperating Systems - Queuing Systems
Operating Systems - Queuing Systems
Emery Berger
 
Operating Systems - Concurrency
Operating Systems - ConcurrencyOperating Systems - Concurrency
Operating Systems - Concurrency
Emery Berger
 
Operating Systems - Advanced Synchronization
Operating Systems - Advanced SynchronizationOperating Systems - Advanced Synchronization
Operating Systems - Advanced Synchronization
Emery Berger
 

Mais de Emery Berger (20)

Doppio: Breaking the Browser Language Barrier
Doppio: Breaking the Browser Language BarrierDoppio: Breaking the Browser Language Barrier
Doppio: Breaking the Browser Language Barrier
 
Dthreads: Efficient Deterministic Multithreading
Dthreads: Efficient Deterministic MultithreadingDthreads: Efficient Deterministic Multithreading
Dthreads: Efficient Deterministic Multithreading
 
Programming with People
Programming with PeopleProgramming with People
Programming with People
 
Stabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance EvaluationStabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance Evaluation
 
DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)
 
Operating Systems - Advanced File Systems
Operating Systems - Advanced File SystemsOperating Systems - Advanced File Systems
Operating Systems - Advanced File Systems
 
Operating Systems - File Systems
Operating Systems - File SystemsOperating Systems - File Systems
Operating Systems - File Systems
 
Operating Systems - Networks
Operating Systems - NetworksOperating Systems - Networks
Operating Systems - Networks
 
Operating Systems - Queuing Systems
Operating Systems - Queuing SystemsOperating Systems - Queuing Systems
Operating Systems - Queuing Systems
 
Operating Systems - Concurrency
Operating Systems - ConcurrencyOperating Systems - Concurrency
Operating Systems - Concurrency
 
Operating Systems - Advanced Synchronization
Operating Systems - Advanced SynchronizationOperating Systems - Advanced Synchronization
Operating Systems - Advanced Synchronization
 
Operating Systems - Synchronization
Operating Systems - SynchronizationOperating Systems - Synchronization
Operating Systems - Synchronization
 
Virtual Memory and Paging
Virtual Memory and PagingVirtual Memory and Paging
Virtual Memory and Paging
 
Operating Systems - Virtual Memory
Operating Systems - Virtual MemoryOperating Systems - Virtual Memory
Operating Systems - Virtual Memory
 
MC2: High-Performance Garbage Collection for Memory-Constrained Environments
MC2: High-Performance Garbage Collection for Memory-Constrained EnvironmentsMC2: High-Performance Garbage Collection for Memory-Constrained Environments
MC2: High-Performance Garbage Collection for Memory-Constrained Environments
 
Vam: A Locality-Improving Dynamic Memory Allocator
Vam: A Locality-Improving Dynamic Memory AllocatorVam: A Locality-Improving Dynamic Memory Allocator
Vam: A Locality-Improving Dynamic Memory Allocator
 
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory ManagementQuantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
 
Garbage Collection without Paging
Garbage Collection without PagingGarbage Collection without Paging
Garbage Collection without Paging
 
DieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesDieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe Languages
 
Exterminator: Automatically Correcting Memory Errors with High Probability
Exterminator: Automatically Correcting Memory Errors with High ProbabilityExterminator: Automatically Correcting Memory Errors with High Probability
Exterminator: Automatically Correcting Memory Errors with High Probability
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Memory Management for High-Performance Applications

  • 1. Memory Management for High-Performance Applications Emery Berger University of Massachusetts Amherst UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science AMHERST
  • 2. High-Performance Applications Web servers,  search engines, scientific codes cpu cpu cpu cpu RAM cpu cpu cpu RAM cpu C or C++ cpu RAM  cpu RAID drive cpu Raid drive cpu Raid drive Run on one or  cluster of server boxes software compiler Needs support at every level  runtime system operating system hardware UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2 AMHERST
  • 3. New Applications, Old Memory Managers Applications and hardware have changed  Multiprocessors now commonplace  Object-oriented, multithreaded  Increased pressure on memory manager  (malloc, free) But memory managers have not kept up  Inadequate support for modern applications  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 3 AMHERST
  • 4. Current Memory Managers Limit Scalability As we add  Runtime Performance processors, 14 13 program slows 12 Ideal 11 10 down Actual 9 Speedup 8 Caused by heap 7  6 5 contention 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of Processors Larson server benchmark on 14-processor Sun UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 4 AMHERST
  • 5. The Problem Current memory managers  inadequate for high-performance applications on modern architectures Limit scalability & application  performance UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 5 AMHERST
  • 6. This Talk Building memory managers  Heap Layers framework  Problems with current memory managers  Contention, false sharing, space  Solution: provably scalable memory manager  Hoard  Extended memory manager for servers  Reap  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 6 AMHERST
  • 7. Implementing Memory Managers Memory managers must be  Space efficient  Very fast  Heavily-optimized C code  Hand-unrolled loops  Macros  Monolithic functions  Hard to write, reuse, or extend  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 7 AMHERST
  • 8. Real Code: DLmalloc 2.7.2 #d e f i n e c h u n k s i z e ( p ) ( ( p ) - >s i z e & ~( S I ZE_BI TS ) ) #d e f i n e n e x t _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) #d e f i n e p r e v _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) - ( ( p ) - >p r e v _s i z e ) ) ) #d e f i n e c h u n k _ a t _ o f f s e t ( p , s ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) #d e f i n e i n u s e ( p ) ( ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) +( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e ) & PREV_I NUS E) #d e f i n e s e t _ i n u s e ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e | = PREV_I NUS E #d e f i n e c l e a r _ i n u s e ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e &= ~( PREV_I NUS E) #d e f i n e i n u s e _ b i t _ a t _ o f f s e t ( p , s ) ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e & PREV_I NUS E) #d e f i n e s e t _ i n u s e _ b i t _ a t _ o f f s e t ( p , s ) ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e | = PREV_I NUS E) #d e f i n e MAL L OC_ ZERO( c h a r p , n b y t e s ) do { I NTERNAL _ S I ZE_ T* mz p = ( I NTERNAL_S I ZE_T* ) ( c h a r p ) ; CHUNK_ S I ZE_ T mc t mp = ( n b y t e s ) /s i z e o f ( I NTERNAL_S I ZE_T) ; l o n g mc n ; i f ( mc t mp < 8 ) mc n = 0 ; e l s e { mc n = ( mc t mp - 1 ) /8 ; mc t mp %= 8 ; } s wi t c h ( mc t mp ) { c a s e 0 : f o r ( ; ; ) { * mz p ++ = 0 ; c a s e 7: * mz p ++ = 0 ; c a s e 6: * mz p ++ = 0 ; c a s e 5: * mz p ++ = 0 ; c a s e 4: * mz p ++ = 0 ; c a s e 3: * mz p ++ = 0 ; c a s e 2: * mz p ++ = 0 ; c a s e 1: * mz p ++ = 0 ; i f ( mc n <= 0 ) b r e a k ; mc n - - ; } } } wh i l e ( 0 ) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 8 AMHERST
  • 9. Programming Language Support Classes Mixins   Overhead No overhead   Rigid hierarchy Flexible hierarchy   Sounds great...  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 9 AMHERST
  • 10. A Heap Layer C++ mixin with malloc & free methods  RedHeapLayer template <class SuperHeap> class GreenHeapLayer : public SuperHeap {…}; GreenHeapLayer UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 10 AMHERST
  • 11. Example: Thread-Safe Heap Layer LockedHeap protect the superheap with a lock LockedMallocHeap m a llocH ea p L ockedH ea p UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 11 AMHERST
  • 12. Empirical Results Runtime (normalized to Lea allocator) Heap Layers vs.  Kingsley KingsleyHeap Lea LeaHeap Normalized Runtime 1.5 originals: 1.25 1 0.75 KingsleyHeap  0.5 0.25 vs. BSD allocator 0 cfrac espresso lindsay LRUsim perl roboop Average Benchmark LeaHeap  vs. DLmalloc 2.7 Space (normalized to Lea allocator) Kingsley KingsleyHeap Lea LeaHeap Competitive  Normalized Space 2.5 2 runtime and 1.5 1 memory efficiency 0.5 0 cfrac espresso lindsay LRUsim perl roboop Average Benchmark UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 12 AMHERST
  • 13. Overview Building memory managers  Heap Layers framework  Problems with memory managers  Contention, space, false sharing  Solution: provably scalable allocator  Hoard  Extended memory manager for servers  Reap  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 13 AMHERST
  • 14. Problems with General-Purpose Memory Managers Previous work for multiprocessors  Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92]  Impractical  Multiple heaps [Larson 98, Gloger 99]  Reduce contention but cause other problems:  P-fold or even unbounded increase in space  we show Allocator-induced false sharing  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 14 AMHERST
  • 15. Multiple Heap Allocator: Pure Private Heaps Key: One heap per processor: = in use, processor 0  = free, on heap 1 gets memory malloc  from its local heap processor 0 processor 1 puts memory free  x1= malloc(1) on its local heap x2= malloc(1) free(x1) free(x2) x4= malloc(1) x3= malloc(1) STL, Cilk, ad hoc free(x3) free(x4)  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 15 AMHERST
  • 16. Problem: Unbounded Memory Consumption processor 0 processor 1 Producer-consumer:  x1= malloc(1) free(x1) Processor 0 allocates  x2= malloc(1) Processor 1 frees free(x2)  x3= malloc(1) free(x3) Unbounded memory  consumption Crash!  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 16 AMHERST
  • 17. Multiple Heap Allocator: Private Heaps with Ownership processor 0 processor 1 returns memory  free x1= malloc(1) to original heap free(x1) x2= malloc(1) Bounded memory  free(x2) consumption No crash!  “Ptmalloc” (Linux),  LKmalloc UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 17 AMHERST
  • 18. Problem: P-fold Memory Blowup Occurs in practice  processor 0 processor 1 processor 2 Round-robin producer- x1= malloc(1)  free(x1) consumer x2= malloc(1) free(x2) processor i mod P allocates  x3=malloc(1) processor (i+1) mod P frees  free(x3) Footprint = 1 (2GB),  but space = 3 (6GB) Exceeds 32-bit address space:  Crash! UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 18 AMHERST
  • 19. Problem: Allocator-Induced False Sharing False sharing  CPU 0 CPU 1 Non-shared objects  on same cache line cache cache Bane of parallel applications  bus Extensively studied  cache line processor 0 processor 1 All these allocators  x1= malloc(1) x2= malloc(1) cause false sharing! thrash… thrash… UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 19 AMHERST
  • 20. So What Do We Do Now? Where do we put free memory?  on central heap: Heap contention   on our own heap: Unbounded memory   (pure private heaps) consumption on the original heap: P-fold blowup   (private heaps with ownership) How do we avoid false sharing?  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 20 AMHERST
  • 21. Overview Building memory managers  Heap Layers framework  Problems with memory managers  Contention, space, false sharing  Solution: provably scalable allocator  Hoard  Extended memory manager for servers  Reap  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 21 AMHERST
  • 22. Hoard: Key Insights Bound local memory consumption   Explicitly track utilization  Move free memory to a global heap  Provably bounds memory consumption Manage memory in large chunks   Avoids false sharing  Reduces heap contention UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 22 AMHERST
  • 23. Overview of Hoard global heap Manage memory in heap blocks  Page-sized  Avoids false sharing  Allocate from local heap block  Avoids heap contention  processor 0 processor P-1 Low utilization  … Move heap block to global heap  Avoids space blowup  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 23 AMHERST
  • 24. Summary of Analytical Results Space consumption: near optimal worst-case  Hoard: O(n log M/m + P) {P « n}  Optimal: O(n log M/m)  n = memory required [Robson 70] M = biggest object size Private heaps with ownership: m = smallest object size  P = processors O(P n log M/m) Provably low synchronization  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 24 AMHERST
  • 25. Empirical Results Measure runtime on 14-processor Sun  Allocators   Solaris (system allocator)  Ptmalloc (GNU libc)  mtmalloc (Sun’s “MT-hot” allocator) Micro-benchmarks  Threadtest: no sharing  Larson: sharing (server-style)  Cache-scratch: mostly reads & writes  (tests for false sharing) Real application experience similar  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 25 AMHERST
  • 26. Runtime Performance: threadtest Many  threads, no sharing Hoard  achieves linear speedup speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 26 AMHERST
  • 27. Runtime Performance: Larson Many  threads, sharing (server-style) Hoard  achieves linear speedup UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 27 AMHERST
  • 28. Runtime Performance: false sharing Many  threads, mostly reads & writes of heap data Hoard  achieves linear speedup UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 28 AMHERST
  • 29. Hoard in the “Real World” Open source code  www.hoard.org  13,000 downloads  Solaris, Linux, Windows, IRIX, …  Widely used in industry  AOL, British Telecom, Novell, Philips  Reports: 2x-10x, “impressive” improvement in performance  Search server, telecom billing systems, scene rendering,  real-time messaging middleware, text-to-speech engine, telephony, JVM Scalable general-purpose memory manager  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 29 AMHERST
  • 30. Overview Building memory managers  Heap Layers framework  Problems with memory managers  Contention, space, false sharing  Solution: provably scalable allocator  Hoard  Extended memory manager for servers  Reap  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 30 AMHERST
  • 31. Custom Memory Allocation Replace new/delete, Very common practice   bypassing general-purpose Apache, gcc, lcc, STL,  allocator database servers… Language-level Reduce runtime – often   support in C++ Expand functionality – sometimes  Reduce space – rarely  “Use custom allocators” UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 31 AMHERST
  • 32. The Reality Lea allocator  Runtime - Custom Allocator Benchmarks often as fast Custom Win32 DLmalloc or faster 1.75 non-regions regions averages Normalized Runtime 1.5 Custom  1.25 1 allocation 0.75 ineffective, 0.5 0.25 except for 0 regions. ll s le ze r ns he c sim r c ra vp se on lc gc l ud ee io ac ve 5. ar gi d- 6. eg m br 17 ap O re .p xe 17 [OOPSLA 2002] R c- 7 - bo on 19 N UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 32 AMHERST
  • 33. Overview of Regions Separate areas, deletion only en masse  regioncreate(r) r regionmalloc(r, sz) regiondelete(r) - Risky Fast + - Accidental deletion Pointer-bumping allocation + - Too much space Deletion of chunks + Convenient + One call frees all memory + UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 33 AMHERST
  • 34. Why Regions? Apparently faster, more space-efficient  Servers need memory management support:  Avoid resource leaks  Tear down memory associated with terminated  connections or transactions Current approach (e.g., Apache): regions  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 34 AMHERST
  • 35. Drawbacks of Regions Can’t reclaim memory within regions  Problem for long-running computations,  producer-consumer patterns, off-the-shelf “malloc/free” programs unbounded memory consumption  Current situation for Apache:  vulnerable to denial-of-service  limits runtime of connections  limits module programming  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 35 AMHERST
  • 36. Reap Hybrid Allocator Reap = region + heap  Adds individual object deletion & heap  reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) Can reduce memory consumption  Fast  Adapts to use (region or heap style)  Cheap deletion  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 36 AMHERST
  • 37. Using Reap as Regions Runtime - Region-Based Benchmarks Original Win32 DLmalloc WinHeap Vmalloc Reap 4.08 2.5 Normalized Runtime 2 1.5 1 0.5 0 lcc mudlle Reap performance nearly matches regions UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 37 AMHERST
  • 38. Reap: Best of Both Worlds Combining new/delete with regions  usually impossible: Incompatible API’s  Hard to rewrite code  Use Reap: Incorporate new/delete code into Apache   “mod_bc” (arbitrary-precision calculator) Changed 20 lines (out of 8000)  Benchmark: compute 1000th prime  With Reap: 240K  Without Reap: 7.4MB  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 38 AMHERST
  • 39. Summary Building memory managers  Heap Layers framework [PLDI 2001]  Problems with current memory managers  Contention, false sharing, space  Solution: provably scalable memory manager  Hoard [ASPLOS-IX]  Extended memory manager for servers  Reap [OOPSLA 2002]  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 39 AMHERST
  • 40. Current Projects CRAMM: Cooperative Robust Automatic Memory  Management  Garbage collection without paging  Automatic heap sizing SAVMM: Scheduler-Aware Virtual Memory Management  Markov:   Programming language for building high-performance servers COLA: Customizable Object Layout Algorithms   Improving locality in Java UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 40 AMHERST
  • 41. www.cs.umass.edu/~plasma UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 41 AMHERST
  • 42. UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 42 AMHERST
  • 43. Looking Forward “New” programming languages  Increasing use of Java = garbage collection  New architectures  NUMA: SMT/CMP (“hyperthreading”)  Technology trends  Memory hierarchy  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 43 AMHERST
  • 44. The Ever-Steeper Memory Hierarchy Higher = smaller, faster, closer to CPU  A real desktop machine (mine)  registers 8 integer, 8 floating-point; 1-cycle latency L1 cache 8K data & instructions; 2-cycle latency L2 cache 512K; 7-cycle latency RAM 1GB; 100 cycle latency Disk 40 GB; 38,000,000 cycle latency (!) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 44 AMHERST
  • 45. Swapping & Throughput Heap > available memory - throughput plummets  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 45 AMHERST
  • 46. Why Manage Memory At All? Just buy more!  Simplifies memory management  Still have to collect garbage eventually…  Workload fits in RAM = no more swapping!  Sounds great…  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 46 AMHERST
  • 47. Memory Prices Over Time RAM Prices Over Time (1977 dollars) $10,000.00 $1,000.00 2K $100.00 8K Dollars per GB 32K $10.00 128K conventional DRAM 512K 2M $1.00 8M $0.10 $0.01 1977 1980 1981 1982 1985 1986 1987 1989 1990 1991 1992 1993 1994 1995 1997 1998 1999 2000 2002 2003 2004 2005 1978 1979 1983 1984 1988 1996 2001 Year “Soon it will be free…” UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 47 AMHERST
  • 48. Memory Prices: Inflection Point! RAM Prices Ov er Time (1977 dollars) $10,000.00 $1,000.00 2K 8K $100.00 32K Dollars per GB 128K $10.00 512K S DRA M , conventional DRAM R DR A M , 2M DDR , Chipkill 8M $1.00 512M 1G $0.10 $0.01 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Year UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 48 AMHERST
  • 49. Memory Is Actually Expensive Desktops:  Most ship with 256MB  1GB = 50% more $$  Laptops = 70%, if possible  Limited capacity  Servers:  Buy 4GB, get 1 CPU  free! Sun Enterprise 10000:  8GB extra = $150,000! 8GB Sun RAM = Fast RAM – new  technologies 1 Ferrari Modena Cosmic rays…  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 49 AMHERST
  • 50. Key Problem: Paging Garbage collectors: VM oblivious  GC disrupts LRU queue  Touches non-resident pages  Virtual memory managers: GC oblivious  Likely to evict pages needed by GC  Paging  Orders of magnitude more time than RAM  BIG hit in performance and LONG pauses  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 50 AMHERST
  • 51. Cooperative Robust Automatic Memory Management (CRAMM) Garbage collector Virtual memory manager I’m a cooperative application! Coarse-grained change in (heap-level) memory pressure Tracks per-process, new heap size Adjusts heap size overall memory utilization Fine-grained page eviction (page-level) notification Evacuates pages Page replacement victim page(s) Selects victim pages Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 51 AMHERST
  • 52. Fine-Grained Cooperative GC Garbage collector Virtual memory manager Fine-grained page eviction notification Evacuates pages Page replacement victim page(s) Selects victim pages Goal: GC triggers no additional paging  Key ideas:  Adapt collection strategy on-the-fly  Page-oriented memory management  Exploit detailed page information from VM  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 52 AMHERST
  • 53. Summary Building memory managers  Heap Layers framework  Problems with memory managers  Contention, space, false sharing  Solution: provably scalable allocator  Hoard  Future directions  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 53 AMHERST
  • 54. If You Have to Spend $$... more Ferraris: good more memory: bad UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 54 AMHERST
  • 55. www.cs.umass.edu/~emery/plasma UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 55 AMHERST
  • 56. This Page Intentionally Left Blank UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 56 AMHERST
  • 57. Virtual Memory Manager Support New VM required: detailed page-level information  “Segmented queue” for low-overhead  unprotected protected Local LRU order per-process, not gLRU (Linux)  Complementary to SAVM work:  “Scheduler-Aware Virtual Memory manager” Under development – modified Linux kernel  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 57 AMHERST
  • 58. Current Work: Robust Performance Currently: no VM-GC communicaton  BAD interactions under memory pressure  Our approach (with Eliot Moss, Scott Kaplan):  Cooperative Robust Automatic Memory Management LRU queue memory pressure Virtual Garbage memory collector empty pages manager / allocator reduced impact UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 58 AMHERST
  • 59. Current Work: Predictable VMM Recent work on scheduling for QoS  E.g., proportional-share  Under memory pressure, VMM is scheduler  Paged-out processes may never recover  Intermittent processes may wait long time  Scheduler-faithful virtual memory  (with Scott Kaplan, Prashant Shenoy)  Based on page value rather than order UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 59 AMHERST
  • 60. Conclusion Memory management for high-performance applications  Heap Layers framework [PLDI 2001] Reusable components, no runtime cost  Hoard scalable memory manager [ASPLOS-IX]  High-performance, provably scalable & space-efficient  Reap hybrid memory manager [OOPSLA 2002]  Provides speed & robustness for server applications  Current work: robust memory management for  multiprogramming UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 60 AMHERST
  • 61. The Obligatory URL Slide http://www.cs.umass.edu/~emery UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 61 AMHERST
  • 62. If You Can Read This, I Went Too Far UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 62 AMHERST
  • 63. Hoard: Under the Hood S ystem Heap get or return memory to global heap HeapBlockManager LockedHeap HeapBlockManager HeapBlockManager S uperblockHeap malloc from local heap, LockedHeap Empty LockedHeap LockedHeap free to heap block Heap Blocks P erP rocessorHeap FreeT oHeapBlock Large objects MallocOrF reeHeap (> 4K) S electS izeHeap select heap based on size UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 63 AMHERST
  • 64. Custom Memory Allocation Replace new/delete, Very common practice   bypassing general-purpose Apache, gcc, lcc, STL,  allocator database servers… Language-level Reduce runtime – often   support in C++ Expand functionality – sometimes  Reduce space – rarely  “Use custom allocators” UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 64 AMHERST
  • 65. Drawbacks of Custom Allocators Avoiding memory manager means:  More code to maintain & debug  Can’t use memory debuggers  Not modular or robust:  Mix memory from custom  and general-purpose allocators → crash! Increased burden on programmers  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 65 AMHERST
  • 66. Overview Introduction  Perceived benefits and drawbacks  Three main kinds of custom allocators  Comparison with general-purpose allocators  Advantages and drawbacks of regions  Reaps – generalization of regions & heaps  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 66 AMHERST
  • 67. (1) Per-Class Allocators Recycle freed objects from a free list  a = new Class1; Class1 Fast free list + b = new Class1; c = new Class1; Linked list operations + a delete a; Simple + delete b; Identical semantics b + delete c; C++ language support + a = new Class1; c Possibly space-inefficient - b = new Class1; c = new Class1; UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 67 AMHERST
  • 68. (II) Custom Patterns Tailor-made to fit allocation patterns  Example: 197.parser (natural language parser)  db a c char[MEMORY_LIMIT] end_of_array end_of_array end_of_array end_of_array end_of_array a = xalloc(8); Fast + b = xalloc(16); Pointer-bumping allocation + c = xalloc(8); - Brittle xfree(b); - Fixed memory size xfree(c); d = xalloc(8); - Requires stack-like lifetimes UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 68 AMHERST
  • 69. (III) Regions Separate areas, deletion only en masse  regioncreate(r) r regionmalloc(r, sz) regiondelete(r) - Risky Fast + - Accidental deletion Pointer-bumping allocation + - Too much space Deletion of chunks + Convenient + One call frees all memory + UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 69 AMHERST
  • 70. Overview Introduction  Perceived benefits and drawbacks  Three main kinds of custom allocators  Comparison with general-purpose allocators  Advantages and drawbacks of regions  Reaps – generalization of regions & heaps  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 70 AMHERST
  • 71. Custom Allocators Are Faster… Runtime - Custom Allocator Benchmarks Custom Win32 1.75 non-regions regions averages Normalized Runtime 1.5 1.25 1 0.75 0.5 0.25 0 s r er he ll ll e ze m c ns c vp on ra gc lc rs si ud ac ee io 5. ve gi 6. d- pa eg m 17 ap br -re O 17 xe 7. R c- on bo 19 N UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 71 AMHERST
  • 72. Not So Fast… Runtime - Custom Allocator Benchmarks Custom Win32 DLmalloc 1.75 non-regions regions averages Normalized Runtime 1.5 1.25 1 0.75 0.5 0.25 0 l s l le s ze r he c er sim al c vp n on lc gc r ud rs io ee ac ve 5. d- 6. i g pa eg m br 17 ap O re 17 xe 7. R c- - bo on 19 N UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 72 AMHERST
  • 73. The Lea Allocator (DLmalloc 2.7.0) Optimized for common allocation patterns  Per-size quicklists ≈ per-class allocation  Deferred coalescing  (combining adjacent free objects) Highly-optimized fastpath  Space-efficient  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 73 AMHERST
  • 74. Space Consumption Results Space - Custom Allocator Benchmarks Original DLmalloc 1.75 non-regions regions averages Normalized Space 1.5 1.25 1 0.75 0.5 0.25 0 ll lle s c r e s e er c im ra vp lc n on z ch c ud rs io ee .g -s ve 5. a i g pa eg 6 ed m br 17 ap O re 17 7. R c- x - bo on 19 N UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 74 AMHERST
  • 75. Overview Introduction  Perceived benefits and drawbacks  Three main kinds of custom allocators  Comparison with general-purpose allocators  Advantages and drawbacks of regions  Reaps – generalization of regions & heaps  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 75 AMHERST
  • 76. Why Regions? Apparently faster, more space-efficient  Servers need memory management support:  Avoid resource leaks  Tear down memory associated with terminated  connections or transactions Current approach (e.g., Apache): regions  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 76 AMHERST
  • 77. Drawbacks of Regions Can’t reclaim memory within regions  Problem for long-running computations,  producer-consumer patterns, off-the-shelf “malloc/free” programs unbounded memory consumption  Current situation for Apache:  vulnerable to denial-of-service  limits runtime of connections  limits module programming  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 77 AMHERST
  • 78. Reap Hybrid Allocator Reap = region + heap  Adds individual object deletion & heap  reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) Can reduce memory consumption  Fast + Adapts to use (region or heap style) + Cheap deletion + UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 78 AMHERST
  • 79. Using Reap as Regions Runtime - Region-Based Benchmarks Original Win32 DLmalloc WinHeap Vmalloc Reap 4.08 2.5 Normalized Runtime 2 1.5 1 0.5 0 lcc mudlle Reap performance nearly matches regions UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 79 AMHERST
  • 80. Reap: Best of Both Worlds Combining new/delete with regions  usually impossible: Incompatible API’s  Hard to rewrite code  Use Reap: Incorporate new/delete code into Apache   “mod_bc” (arbitrary-precision calculator) Changed 20 lines (out of 8000)  Benchmark: compute 1000th prime  With Reap: 240K  Without Reap: 7.4MB  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 80 AMHERST
  • 81. Conclusion Empirical study of custom allocators  Lea allocator often as fast or faster  Custom allocation ineffective,  except for regions Reaps:  Nearly matches region performance  without other drawbacks Take-home message:  Stop using custom memory allocators!  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 81 AMHERST
  • 82. Software http://www.cs.umass.edu/~emery (part of Heap Layers distribution) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 82 AMHERST
  • 83. Experimental Methodology Comparing to general-purpose allocators  Same semantics: no problem  E.g., disable per-class allocators  Different semantics: use emulator  Uses general-purpose allocator  but adds bookkeeping regionfree: Free all associated objects  Other functionality (nesting, obstacks) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 83 AMHERST
  • 84. Use Custom Allocators? Strongly recommended by practitioners  Little hard data on performance/space  improvements Only one previous study [Zorn 1992]  Focused on just one type of allocator  Custom allocators: waste of time  Small gains, bad allocators  Different allocators better? Trade-offs?  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 84 AMHERST
  • 85. Kinds of Custom Allocators Three basic types of custom allocators  Per-class  Fast  Custom patterns  Fast, but very special-purpose  Regions  Fast, possibly more space-efficient  Convenient  Variants: nested, obstacks  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 85 AMHERST
  • 86. Optimization Opportunity Time Spent in Memory Operations Memory Operations Other 100 80 % of runtime 60 40 20 0 lcc ll e sim cc e ze e pr r se ag h ud v g ee ac 5. d- 6. ar er m ap br 17 xe 17 p Av 7. c- bo 19 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 86 AMHERST
  • 87. UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 87 AMHERST
  • 88. Custom Memory Allocation Programmers often replace malloc/free  Attempt to increase performance  Provide extra functionality (e.g., for servers)  Reduce space (rarely)  Empirical study of custom allocators  Lea allocator often as fast or faster  Custom allocation ineffective,  except for regions. [OOPSLA 2002] UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 88 AMHERST
  • 89. Overview of Regions Separate areas, deletion only en masse  regioncreate(r) r regionmalloc(r, sz) regiondelete(r) - Risky Fast + - Accidental deletion Pointer-bumping allocation + - Too much space Deletion of chunks + Convenient + One call frees all memory + UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 89 AMHERST
  • 90. Why Regions? Apparently faster, more space-efficient  Servers need memory management support:  Avoid resource leaks  Tear down memory associated with terminated  connections or transactions Current approach (e.g., Apache): regions  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 90 AMHERST
  • 91. Drawbacks of Regions Can’t reclaim memory within regions  Problem for long-running computations,  producer-consumer patterns, off-the-shelf “malloc/free” programs unbounded memory consumption  Current situation for Apache:  vulnerable to denial-of-service  limits runtime of connections  limits module programming  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 91 AMHERST
  • 92. Reap Hybrid Allocator Reap = region + heap  Adds individual object deletion & heap  reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) Can reduce memory consumption  Fast  Adapts to use (region or heap style)  Cheap deletion  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 92 AMHERST
  • 93. Using Reap as Regions Runtime - Region-Based Benchmarks Original Win32 DLmalloc WinHeap Vmalloc Reap 4.08 2.5 Normalized Runtime 2 1.5 1 0.5 0 lcc mudlle Reap performance nearly matches regions UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 93 AMHERST
  • 94. Reap: Best of Both Worlds Combining new/delete with regions  usually impossible: Incompatible API’s  Hard to rewrite code  Use Reap: Incorporate new/delete code into Apache   “mod_bc” (arbitrary-precision calculator) Changed 20 lines (out of 8000)  Benchmark: compute 1000th prime  With Reap: 240K  Without Reap: 7.4MB  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 94 AMHERST