SlideShare uma empresa Scribd logo
1 de 4
Baixar para ler offline
High-Performance Operating System Controlled Memory
                      Compression
                                    Lei Yang†                           Haris Lekatsas‡                   Robert P. Dick†
             †(l-yang,dickrp)@ece.northwestern.edu                                                      ‡lekatsas@nec-labs.com
                        Northwestern University                                                        NEC Laboratories America
                          Evanston, IL 60208                                                             Princeton, NJ 08540

ABSTRACT                                                                            able to applications without changing applications, without chang-
                                                                                    ing hardware, and without significant performance degradation, for
This article describes a new software-based on-line memory com-                     all realistic embedded applications. In this article, we describe two
pression algorithm for embedded systems and presents a method                       techniques to further improve the performance of on-line software-
of adaptively managing the uncompressed and compressed mem-                         based memory compression.
ory regions during application execution. The primary goal of this                     We present a very fast, high-quality compression algorithm for
work is to save memory in disk-less embedded systems, resulting in                  working data set pages. This algorithm, named pattern-based par-
greater functionality, smaller size, and lower overall cost, without                tial match (PBPM), explores frequent patterns that occur within
modifying applications or hardware. In comparison with algorithms                   each word of memory and takes advantage of the similarities among
that are commonly used in on-line memory compression, our new                       words by keeping a small two-way, hashed, set associative dictio-
algorithm has a comparable compression ratio but is twice as fast.                  nary that is managed with a least-recently used (LRU) replacement
The adaptive memory management scheme effectively responds to                       policy. PBPM has a compression ratio that is competitive with the
the predicted needs of applications and prevents on-line memory                     best compression algorithms of the Lempel-Ziv family [2] while
compression deadlock, permitting reliable and efficient compres-                     exhibiting lower run-time and memory overhead.
sion for a wide range of applications. We have evaluated our tech-                     In addition, we present an adaptive compressed memory manage-
nique on an embedded portable device and have found that the                        ment technique that predictively allocates memory for compressed
memory available to applications can be increased by 150%, al-                      data. Experimental results show that our new pre-allocation method
lowing the execution of applications with larger working data sets,                 is able to further increase available memory to applications by up
or allowing existing applications to run with less physical memory.                 to 13% compared to the same system without pre-allocation.
Categories and Subject Descriptors: C.3 [Special-Purpose and
Application-Based Systems]: Real-time and embedded systems                          2. Related work
General Terms: Algorithms, Management, Performance                                  A number of previous approaches incorporated compression into
Keywords: Virtual memory, Compression                                               the memory hierarchy for different goals. Main memory compres-
                                                                                    sion techniques [3] insert a hardware compression/decompression
1. Introduction                                                                     unit between cache and RAM. Data is stored uncompressed in cache,
Every year, embedded system designers pack more functionality                       and compressed on-the-fly when transferred to memory. Code com-
into less space, while meeting tight performance and power con-                     pression techniques [4] store instructions in compressed format and
sumption constraints. Software running on some embedded de-                         decompress them during execution. Compression is usually done
vices, such as cellular phones and PDAs, is becoming increasingly                   off-line and can be slow, while decompression is done during exe-
complicated, requiring faster processors and more memory to exe-                    cution, usually by special hardware, and must be very fast.
cute. In this work, we propose a method of increasing functionality                     Compressed caching [5,6] introduces a software cache to the vir-
without changes to hardware or applications by making better use                    tual memory system that uses part of the memory to store data in
of physical memory via on-line memory compression.                                  compressed format. Swap Compression [7,8] compresses swapped
   Our existing memory compression framework, CRAMES [1],                           pages and stores them in a memory region that acts as a cache be-
takes advantage of an operating system’s (OS’s) virtual memory                      tween memory and disk. The primary objective of both techniques
infrastructure by storing swapped-out pages in compressed format.                   is to improve system performance by decreasing the number of page
It dynamically adjusts the size of the compressed RAM area, pro-                    faults that must be serviced by hard disks. Both techniques require
tecting applications capable of running without it from performance                 a backing store, i.e., hard disks, when the compressed cache is filled
or energy consumption penalties. This memory compression tech-                      up. Unlike hardware-based main memory compression techniques,
nique has been implemented as a loadable module for the Linux                       neither compressed caching nor swap compression is able to com-
kernel and evaluated on a battery-powered embedded system. Ex-                      press code in memory.
perimental results indicate that it is capable of doubling the amount                   IBM’s MXT technology [3] used a hardware parallelized deriva-
of RAM available to applications. When used for this purpose, it                    tive of LZ77 [2]. Kjelso, et al. [9] designed X-Match, a hardware-
has little or no impact on the execution times and energy consump-                  based dictionary coding algorithm. Wilson [6] proposed WKdm, a
tions of applications capable of running without CRAMES.                            software-based dictionary coding algorithm. Rizzo [7] presented a
   Although this version of CRAMES that used existing compres-                      software-based algorithm that compresses in-RAM data by exploit-
sion algorithms greatly increased usable memory for a wide range                    ing the high frequency of zero-valued data.
of applications, using it to cut physical memory to 40% increased
application execution time by 29%, in the worst case. This con-                     3. Overview of CRAMES
flicted with our goal of dramatically increasing the memory avail-                   This section briefly introduces the design and implementation of
                                                                                    CRAMES to provide readers a context for the techniques presented
This work is supported in part by NEC Laboratories America and in part by the NSF   in this paper. To increase available memory to embedded systems,
under award CNS-0347941.
                                                                                    CRAMES selectively compresses data pages when the working data
                                                                                    sets of processes exceed physical RAM capacity. When data in a
Permission to make digital or hard copies of all or part of this work for           compressed page is later required by a process, CRAMES locates
personal or classroom use is granted without fee provided that copies are           that page, decompresses it, and copies it back to the main mem-
not made or distributed for profit or commercial advantage and that copies           ory working area, allowing the process to continue executing. In
bear this notice and the full citation on the first page. To copy otherwise, to      order to minimize performance and energy consumption impact,
republish, to post on servers or to redistribute to lists, requires prior specific   CRAMES takes advantage of the OS virtual memory swapping
permission and/or a fee.                                                            mechanism to decide which data pages to compress and when to
DAC 2006, July 24–28, 2006, San Francisco, California, USA.                         perform compression and decompression. CRAMES requires an
Copyright 2006 ACM 1-59593-381-6/06/0007 ...$5.00.                                  MMU. However, no other special-purpose hardware is required.
100%
                                                                                                                                                          xmxx
                                                                                                                                                                   Code      Pattern      Output            Size (bits)    Frequency
                 90%
                                                                                                                                                          xzxz     00        zzzz         00                    2            38.0%
                                                                                                                                                          zxxz     01        xxxx         01BBBB                34           21.6%
                 80%                                                                                                                                      xzzx
                                                                                                                                                                   10        mmmm         10bbbb                6            11.2%
                                                                                                                                                          zxzz
                 70%                                                                                                                                      zzxz
                                                                                                                                                                   1100      zzzx         1100B                 12            9.3%
                                                                                                                                                                                                                24            8.9%
  Percentage %




                                                                                                                                                          xzzz     1101      mmxx         1101bbbbBB
                 60%
                                                                                                                                                          xxzz     1110      mmmx         1110bbbbB             16            7.7%
                 50%
                                                                                                                                                          xxzx
                                                                                                                                                                   1111      zxzx         1111BB                20            3.1%
                                                                                                                                                          xzxx

                 40%                                                                                                                                      xxxz


                 30%
                                                                                                                                                          zxxx
                                                                                                                                                          zzxx
                                                                                                                                                                                Table 1: Pattern encoding in PBPM
                                                                                                                                                          zxzx
                 20%                                                                                                                                      mmmx   encountered during memory compression. Numerical values are of-
                 10%
                                                                                                                                                          mmxx   ten small enough to be stored in 4, 8, or 16 bits, but are normally
                                                                                                                                                          mmmm
                                                                                                                                                          zzzx
                                                                                                                                                                 stored in full 32-bit words. Furthermore, numerical values tend to
                  0%
                                                                                                                                                          xxxx   be similar to other values in nearby locations. Likewise, pointers
                                                                  )
                                )

                                        )
                       1)




                                                1)




                                                                          1)

                                                                                   2)

                                                                                            4)


                                                                                                       )

                                                                                                               2)




                                                                                                                                                     4)
                                                         2)




                                                                                                                        4)


                                                                                                                                   )

                                                                                                                                             )
                             ,2




                                                               ,4




                                                                                                                                                                 often point to adjacent objects in memory, or are similar to other
                                     ,4




                                                                                                    ,1




                                                                                                                                ,1

                                                                                                                                          ,2
                    6,




                                                                                2,




                                                                                                                                                  4,
                                             2,

                                                      6,




                                                                       4,




                                                                                         6,




                                                                                                            4,

                                                                                                                     2,
                                                                                                                                                          zzzz
                            (8

                                    (4




                                                              (8




                                                                                                 28




                                                                                                                             56

                                                                                                                                       28
                   (1




                                            (3

                                                     (1




                                                                      (6

                                                                               (3

                                                                                        (1




                                                                                                           (6

                                                                                                                    (3




                                                                                                                                                 (6
                                                                                             (1




                                                                                                                         (2

                                                                                                                                   (1
                                                     Dictionary layout (size, level)                                                                             pointers in nearby locations.
                                                                                                                                                                    In order to develop a reasonable set of frequent patterns, we ex-
                                 Figure 1: Frequent pattern histogram                                                                                            perimented with a 64 MB swap data file from a workstation running
                                                                                                                                                                 SuSE Linux 9.0. Various applications were executed to exhaust
    To ensure good performance, the compression algorithm used in                                                                                                physical memory and trigger swapping. Figure 1 shows the relative
CRAMES must be highly efficient at both compression and decom-                                                                                                    frequencies of patterns we evaluated. Below we specify the conven-
pression. It must also provide a good compression ratio, yet require                                                                                             tions in describing the data and patterns, as well as the dictionary
little working memory. In addition, CRAMES must efficiently or-                                                                                                   management scheme we considered.
ganize the compressed swap device to enable fast compressed page                                                                                                    We consider each 32-bit word (four bytes) as an input, and rep-
access and minimal memory waste. CRAMES dynamically ad-                                                                                                          resent them with four symbols, each of which represents a byte. A
justs the size of the compressed area during operation based on the                                                                                              ‘z’ represents a zero byte, an ‘x’ represents an arbitrary byte, and an
amount of memory required so that applications capable of running                                                                                                ‘m’ represents a byte that matches with a dictionary entry. Follow-
without memory compression do not suffer performance or energy                                                                                                   ing this convention, ‘zzzz’ indicates an all-zero word, while ‘mmmx’
consumption penalties as a result of its use. In addition to data set                                                                                            indicates a partial match with one dictionary entry for which only
compression, CRAMES also supports compression of arbitrary in-                                                                                                   the lowest byte differs.
RAM filesystems.                                                                                                                                                     To allow fast search and update operations, we maintain a hash-
    CRAMES has been implemented as a loadable Linux kernel mod-                                                                                                  mapped dictionary. More specifically, the third byte of a word is
ule for maximum portability and modularity, however it can easily                                                                                                hash-mapped to a 256 entry hash table, the contents of which are
be ported to other modern OSs. The module was evaluated on a                                                                                                     random indices that are within the range of the dictionary. Based
next generation smartphone prototype and a battery-powered PDA                                                                                                   on this hash function, we only need to consider four match patterns:
using well-known batch benchmarks as well as interactive applica-                                                                                                ‘mmmm’ (full match), ‘mmmx’ (highest three bytes match), ‘mmxx’
tions with graphical user interfaces (GUIs). The results show that                                                                                               (highest two bytes match), and ‘xmxx’ (only the third byte matches).
CRAMES is capable of dramatically increasing memory capacity                                                                                                     Note that neither the hash table nor the dictionary need be stored
with small performance and power consumption costs.                                                                                                              with the compressed data. The hash table is static and the dynamic
                                                                                                                                                                 dictionary is regenerated automatically during decompression.
4. Pattern-based partial match compression                                                                                                                          We experimented with different dictionary sizes/layouts, e.g., 16-
The original implementation of CRAMES used the LZO [10] algo-                                                                                                    entry direct-mapped and 32-entry two-way set associative, etc. A
rithm to compress data in memory. LZO is significantly faster than                                                                                                direct hash-mapped dictionary has the advantage of supporting fast
many other general-purpose compression algorithms, e.g., LZW se-                                                                                                 search and update: only a single hashing operation and lookup are
ries algorithms, gzip, and bzip2. However, it is not designed for                                                                                                required per access. However, it has tightly limited memory. For
memory compression and therefore does not fully exploit the regu-                                                                                                each hash target, only the most recently observed word is remem-
larities of in-RAM data. In addition, LZO requires 64 KB of work-                                                                                                bered; the victim to be replaced is decided entirely by its hash tar-
ing memory, a significant overhead on many memory-constrained                                                                                                     get. In contrast, if a dictionary is maintained with move-to-front
embedded systems. In summary, LZO is a general-purpose algo-                                                                                                     strategy, its LRU entry is selected as the victim. Unfortunately,
rithm with good compression ratio and performance. However, bet-                                                                                                 searching in such a dictionary is slow. A set associative dictionary
ter results are possible for the on-line memory compression appli-                                                                                               can enjoy the benefits of both LRU replacement and speed. When
cation. In this paper, we analyze the regularities of in-RAM data                                                                                                a search miss followed by a dictionary update occurs, the oldest of
and describe a new algorithm, named PBPM, that is extremely fast                                                                                                 the dictionary entries sharing one hash target index is replaced.
and well-suited for memory compression.                                                                                                                             As Figure 1 illustrates, zero words, ‘zzzz’, are the most frequent
   The PBPM algorithm is based on the observation that frequently-                                                                                               compressible pattern (38%), followed by one byte positive sign-
encountered data patterns can be encoded with fewer bits to save                                                                                                 extended words ‘zzzx’ (9.3%). ‘zxzx’ has a frequency of 2.8%.
space. Scanning through the input data a word (32 bits) at a time,                                                                                               Other zero-related patterns are infrequent. As the dictionary size
PBPM exploits patterns that occur frequently within each word of                                                                                                 increases, dictionary match (including partial match) frequencies
memory and searches for complete and partial matches with dic-                                                                                                   do not increase much. While a set associative dictionary usually
tionary entries to take advantage of the similarities among words.                                                                                               generates more matches than a direct hash-mapped dictionary with
More specifically, (1) some patterns that are very frequent are en-                                                                                               the same overall size, a four-way set associative dictionary works
coded using special bit sequences that are much shorter than the                                                                                                 no better than a two-way set associative dictionary.
original data, (2) patterns that do not fall into the above category                                                                                             4.2. The PBPM compression algorithm
and are found in a small dictionary are encoded using the index of
their location in dictionary, and finally (3) patterns that do not fre-                                                                                           The PBPM compression and decompression algorithms are pre-
quently occur and cannot be found in the dictionary are stored in                                                                                                sented in Algorithm 1. PBPM maintains a small two-way set asso-
dictionary while the word contents are output.                                                                                                                   ciative dictionary DICT[] of 16 recently-seen words. An incoming
                                                                                                                                                                 word can fully match a dictionary entry, or match only the high-
4.1. In-RAM data patterns                                                                                                                                        est three bytes or two bytes of a dictionary entry. These patterns
Unlike general-purpose algorithms designed for text data, a special-                                                                                             occurred frequently during swap trace analysis. The patterns and
purpose algorithm designed for in-RAM data compression must                                                                                                      coding schemes are summarized in Table 1, which also reports the
fully exploit the regularities present in memory. In-RAM data fre-                                                                                               actual frequency of each pattern observed in our swap data file when
quently have certain patterns. For example, pages are usually zero-                                                                                              other infrequent patterns are ignored. In Algorithm 1 and column
filled after being allocated. Therefore, runs of zeroes are commonly                                                                                              ‘Output’ of Table 1, ‘B’ represents a byte and ‘b’ represents a bit.
0.46                                                1.2e-004                                                  4.0e-005
                                      lzo                                                     lzo                                                       lzo                                Total memory allocated                        Average execution time
                                   pbpm                                                    pbpm                                        3.5e-005      pbpm




                                                                                                              Decompression Time (s)
                         0.44                                                1.0e-004                                                                                                 40                        38                   9              8.39




                                                      Compression Time (s)
                                                                                                                                                                                                                                                            8.25
                                    lzrw                                                    lzrw                                                      lzrw
     Compression Ratio
                         0.42                                                                                                          3.0e-005                                       36              33                             8
                                                                             8.0e-005                                                                                                 32                                             7
                                                                                                                                       2.5e-005
                         0.40                                                                                                                                                         28




                                                                                                                                                                                                                     Time (second)
                                                                                                                                                                        Memory (MB)
                                                                                                                                                                                                                                     6
                                                                             6.0e-005                                                  2.0e-005                                       24
                         0.38                                                                                                                                                                                                        5     4.47
                                                                                                                                       1.5e-005                                       20
                                                                                                                                                                                              16
                                                                             4.0e-005                                                                                                 16
                                                                                                                                                                                                                                     4
                         0.36                                                                                                          1.0e-005                                                                                      3
                                                                                                                                                                                      12
                         0.34                                                2.0e-005                                                                                                                                                2
                                                                                                                                       5.0e-006                                        8
                                                                                                                                                                                       4                                             1
                         0.32                                                0.0e+000                                                  0.0e+000
                                2048 4096 6144 8192                                     2048 4096 6144 8192                                       2048 4096 6144 8192                  0                                             0
                                                                                                                                                                                             None   CRAMES A−CRAMES                        None   CRAMES A−CRAMES
                                 Block Size (bytes)                                      Block Size (bytes)                                        Block Size (bytes)


                          Figure 2: Compression ratios and speeds of PBPM, LZO, and LZRW                                                                                              Figure 3: Performance of A-CRAMES
Algorithm 1 PBPM (a) compression and (b) decompression                                                                                               We propose the following scheme to prevent on-line memory
                                                                                                                                                  compression deadlock. CRAMES monitors the compressed area
Require: IN, OUT word stream                                                   Require: IN, OUT word stream                                       utilization and requests the allocation of a new memory chunk based
Require: TAPE, INDX bit stream                                                 Require: TAPE, INDX bit stream
Require: DATA byte stream                                                      Require: DATA byte stream                                          on the saturation of current memory chunks. When the total amount
 1: for word in range of IN do                                                  1: Unpack(OUT)                                                    of memory in the compressed area is above a predefined fill ratio,
 2: if word = zzzz then                                                         2: for code in range of TAPE do                                   CRAMES requests a new chunk from kernel. This request may
 3:      TAPE       00                                                          3: if code = 00 then                                              also be denied if the system memory is dangerously low. How-
 4: else if word = zzzx then                                                    4:      OUT      zzzz
 5:      TAPE       1100                                                        5: else if code = 1100 then                                       ever, even if this first request is denied, subsequent invocations
 6:      DATA B                                                                 6:      B DATA
 7: else if word = zxzx then                                                    7:      OUT      zzzB
                                                                                                                                                  of CRAMES will generate additional requests. After low-memory
 8:      TAPE       1111                                                        8: else if code = 1111 then                                       conditions cause applications to swap out pages to the compressed
 9:      DATA BB                                                                9:      BB DATA
10: else                                                                       10:       OUT      zBzB                                            RAM device, more memory will be available and the preemptive
11:       mmmm DICT[hash(word)]                                                11: else if code = 10 then                                         compressed RAM device allocation requests will finally be suc-
12:       if word = mmmm then                                                  12:       bbbb INDX                                                cessful. We have experimented with different fill ratios and found
13:          TAPE      10                                                      13:       OUT      DICT[bbbb]
14:          INDX      bbbb                                                    14: else if code = 1110 then                                       that 7/8 is sufficient to permit successful requests of compressed
15:       else if word = mmmx then                                             15:       bbbb INDX                                                RAM device memory allocation in all tested applications. This ra-
16:          TAPE      1110                                                    16:       mmmm DICT[bbbb]
17:          INDX      bbbb                                                    17:       B DATA                                                   tio also results in little memory waste. Evaluation of this method
18:          DATA      B                                                       18:       OUT      mmmB                                            on a portable embedded system is presented in Section 6.2.
19:          Insert word to DICT                                               19:       Insert mmmB to DICT
20:       else if word = mmxx then                                             20: else if code = 1101 then
21:          TAPE      1101                                                    21:       bbbb INDX
22:          INDX      bbbb                                                    22:       mmmm DICT[bbbb]                                          6. Evaluation
23:          DATA      BB                                                      23:       BB DATA
24:          Insert word to DICT                                               24:       OUT      mmBB                                            In this section, we describe the evaluation methodology and results
25:       else                                                                 25:       Insert mmBB to DICT
26:          TAPE      01                                                      26: else if code = 01 then                                         of the techniques proposed for high-performance on-line memory
27:          DATA      BBBB                                                    27:       BBBB DATA                                                compression. More specifically, the following questions are exper-
28:          Insert word to DICT                                               28:       OUT      BBBB
29:       end if                                                               29:       Insert BBBB to DICT                                      imentally evaluated:
30: end if                                                                     30: end if
31: end for                                                                    31: end for                                                        1. Does PBPM provide a comparable compression ratio yet have
32: OUT Pack(TAPE,DATA,INDX)                                                                                                                      even lower performance costs than existing algorithms?
                                                                                                                                                  2. Does adaptive memory management enable CRAMES to pro-
5. Adaptive compressed memory management                                                                                                          vide more memory to applications when system RAM is tightly
                                                                                                                                                  constrained?
As described in Section 3, CRAMES compresses the swapped-out                                                                                      3. What is the overall performance of CRAMES using PBPM and
data a page (4,096 bytes) at a time and stores them in a special com-                                                                             adaptive memory management?
pressed RAM device. Upon initialization, the compressed RAM
device only requests a small memory chunk (usually 64 KB) and re-                                                                                    To evaluate the proposed techniques, we used a Sharp Zaurus SL-
quests additional memory chunks as system memory requirements                                                                                     5600 PDA. This battery-powered embedded system runs an embed-
grow. Therefore, the size of a compressed RAM device dynam-                                                                                       ded version of Linux, has a 400 MHz Intel XScale PXA250 proces-
ically adjusts during operation, adapting to the data memory re-                                                                                  sor, 32 MB of flash memory, and 32 MB of RAM. In our current
quirements of the currently running applications.                                                                                                 system configuration, 12 MB of RAM are used for uncompressed,
                                                                                                                                                  battery-backed filesystem storage and 20 MB are available to kernel
   The above dynamic memory allocation strategy works well when                                                                                   and user applications.
the system is not under extreme memory pressure. However, it suf-
fers from a performance penalty when the system is dangerously                                                                                    6.1. Quality and speed of the PBPM algorithm
low on memory: applications and CRAMES compete for remain-                                                                                        We evaluated the compression ratio and speed of the PBPM algo-
ing physical memory and there is no guarantee that an allocation                                                                                  rithm compared to two other compression algorithms that have been
request will be satisfied. If physical memory is nearly exhausted,                                                                                 used for on-line memory compression: LZO and LZRW. Figure 2
an application may be unable to allocate additional memory. If                                                                                    illustrates the compression ratios (compressed block size divided
CRAMES were able to allocate even a page of physical memory
for its compressed memory region, it would be able to swap out                                                                                    by original block size) and execution times of evaluated algorithms.
                                                                                                                                                  For these comparisons, the source file for compression is the swap
(on average) two pages, allowing the application to proceed. How-                                                                                 data file (divided into uniform-sized blocks) used to identify the
ever, CRAMES is in contention with the application, resulting in a                                                                                frequent patterns in memory. The evaluation was performed on
scenario we call on-line memory compression deadlock.
                                                                                                                                                  a Linux Workstation with a 2.40 GHz Intel Pentium 4 processor.
   To avoid on-line memory compression deadlock, a compressed                                                                                     Note that OS-controlled on-line memory compression is a symmet-
RAM device needs to be predictive during its requests for addi-                                                                                   ric application, i.e., a memory page is decompressed exactly once
tional memory, i.e., it cannot wait until no existing chunks can al-                                                                              every time it is compressed. Therefore, the overall, symmetric, per-
locate a fit slot for the incoming data. This subtle issue comprises a                                                                             formance of a compression algorithm is the critical performance
difference between our technique and compressed caching as well                                                                                   metric. Overall, PBPM achieves a 200% speedup over LZO and
as swap compression, in which hard drives serve as backing store                                                                                  LZRW. Yet the compression ratio achieved by PBPM is comparable
to which pages can be moved as soon as (or even earlier than) the                                                                                 with that of LZO and LZRW. We believe that PBPM is especially
compressed area is stuffed, so that the compressed memory is al-                                                                                  suitable for on-line memory compression because of its extremely
ways available to applications.                                                                                                                   fast symmetric compression and good compression ratio.
Benchmark Description        RAM             Adpcm                  Jpeg                        Mpeg2                      Matrix Mul.
                                    (MB)   w.o.     LZO     PBPM   w.o.   LZO     PBPM        w.o.     LZO      PBPM      w.o.      LZO       PBPM
   Adpcm: Speech compression                                                    Execution Time (s)
   Source: MediaBench                8     4.83      1.69   1.43   0.71   0.26     0.23      79.35     80.30    77.96     n.a       39.26    38.68
   Data size: 24 KB                  9     3.69      1.35   1.26   0.44   0.21     0.21      76.80     76.83    74.04     n.a       37.40    38.24
                                     10    1.41      1.34   1.36   0.23   0.21     0.21      79.06     76.93    75.32    59.11      39.56    37.18
   Code size: 4 KB                   11    1.37      1.40   1.40   0.26   0.25     0.21      80.57     76.81    76.83    44.44      38.42    42.65
                                     12    1.37      1.31   1.32   0.24   0.21     0.19      76.79     76.94    76.95    41.72      38.73    43.96
   Jpeg: Image encoding              20    1.31      1.30   1.30   0.23   0.21     0.22      76.60     76.77    76.76    43.02      41.41    42.97
   Source: MediaBench                                                         Power Consumption (W)
   Data size: 176 KB                 8     2.13      2.13   2.13   2.15   2.16     2.15       2.41      2.41     2.51      n.a       2.26    2.29
   Code size: 72 KB                  9     2.10      2.10   2.13   2.15   2.02     2.07       2.41      2.40     2.50      n.a       2.26    2.29
                                     10    2.09      2.10   2.09   2.00   1.99     2.04       2.39      2.40     2.48     2.24       2.25    2.29
   Mpeg2: Video CODEC                11    2.12      2.09   2.13   2.05   2.04     2.07       2.40      2.40     2.50     2.26       2.25    2.29
   Source: MediaBench                12    2.09      2.13   2.11   2.03   2.05     2.10       2.40      2.41     2.55     2.25       2.25    2.29
   Data size: 416 KB                 20    2.11      2.09   2.18   2.15   2.02     2.24       2.42      2.43     2.57     2.28       2.27    2.29
   Code size: 48 KB                                                           Energy Consumption (J)
                                     8     10.34     3.60   3.04   1.51   0.56     0.49     190.99     193.42   195.71     n.a      88.74    88.62
   Matrix Mul.: 512 by 512 matrix    9     7.75      2.84   2.68   0.94   0.42     0.43     185.38     184.55   185.10     n.a      84.70    87.64
   multiplication                    10    2.94      2.79   2.85   0.47   0.42     0.42     188.62     184.34   186.42   131.05     88.99    85.01
   Data size: 2948 KB                11    2.89      2.93   2.97   0.54   0.52     0.44     193.10     184.69   191.94   100.01     86.38    97.79
                                     12    2.86      2.79   2.79   0.49   0.43     0.41     184.45     185.74   196.33   93.65      86.94   100.81
   Code size: 4 KB                   20    2.75      2.72   2.82   0.48   0.43     0.49     185.72     186.56   197.26   98.27      94.07    98.39


                                                  Figure 4: Overall performance of CRAMES
6.2. Effectiveness of adaptive memory management                          resents a substantial improvement over LZO, for which the aver-
In order to determine the effectiveness of adaptive memory man-           age performance penalty is 9.5% and the worst-case performance
agement in providing more memory to applications under signifi-            penalty can be as high as 29%.
cant memory pressure, we designed the following experiments. We
wrote a ‘memeater’ program that continuously requests 1 MB of             7. Conclusions and acknowledgments
memory at a time and fills the allocated memory with random num-           High-performance OS controlled memory compression can assist
bers (including zero runs with similar frequency to that observed         embedded system designers to optimize hardware design for typi-
in real swap traces), until an allocation request fails. Memeater         cal software memory requirements while also supporting (sets of)
was then executed on Zaurus under three different system settings:        applications with larger data sets. In this paper, we proposed and
without using CRAMES (none), using CRAMES without adap-                   evaluated a fast software-based compression algorithm for use in
tive memory management (CRAMES), and using CRAMES with                    this application. This algorithm provides comparable compression
adaptive memory management (A-CRAMES). Figure 3 presents                  ratios to existing algorithms used in on-line memory compression
the total memory allocated and average execution times under the          with significantly better symmetric performance. We also presented
three system settings (memeater was executed multiple times un-           an adaptive compressed memory management scheme to prevent
der each setting). Without CRAMES, the system was only able to            on-line memory compression deadlock, permitting reliable and ef-
provide 16 MB of memory to memeater. With CRAMES, 33 MB                   ficient on-line memory compression for a wide range of applica-
of memory were provided and the execution time was proportional           tions. Experimental results indicate that the improved CRAMES
to the amount of memory allocated, i.e., no delay was observed.           allows applications to execute with only slight penalties even when
Furthermore, when adaptive memory management is enabled (A-               system RAM is reduced to 40% of its original size.
CRAMES), 38 MB of memory were allocated with no additional                   We would like to acknowledge Dr. Srimat Chakradhar of NEC
cost. These results support our claim in Section 5 that A-CRAMES          Laboratories America for his support and technical advice. We
helps to prevent on-line data compression deadlock.                       would also like to acknowledge Hui Ding of Northwestern Uni-
6.3. Overall performance of the improved CRAMES                           versity for his suggestions on efficiently implementing PBPM.
Embedded system with high-performance on-line memory com-                 8. References
pression can be designed with less RAM and still support desired
applications. In order to evaluate the impact of using CRAMES              [1] L. Yang, et al., “CRAMES: Compressed RAM for embedded
to reduce physical RAM, we artificially constrained the memory                  systems,” in Proc. Int. Conf. Hardware/Software Codesign and
size of a Zaurus with a kernel module that permanently reserves                System Synthesis, Sept. 2005.
a certain amount of physical memory. The memory allocated to               [2] J. Ziv and A. Lempel, “A universal algorithm for sequential data
a kernel module cannot be swapped out and therefore is not com-                compression,” IEEE Transactions on Information Theory, vol. 23,
pressed by CRAMES. This guarantees the fairness of our compari-                no. 3, pp. 337–343, 1977.
son. With reduced physical RAM, we measured and compared the               [3] B. Tremaine, et al., “IBM memory expansion technology,” IBM
run times, power consumptions, and energy consumptions of four                 Journal of Research and Development, vol. 45, no. 2, Mar. 2001.
                                                                           [4] H. Lekatsas, J. Henkel, and W. Wolf, “Code compression for low
batch benchmarks, i.e., three applications from MediaBench [11]                power embedded system design,” in Proc. Design Automation Conf.,
(Adpcm, Jpeg, Mpeg2) and a 512 by 512 matrix multiplication ap-                June 2000, pp. 294–299.
plication. Figure 4 shows execution times, power consumptions,             [5] F. Douglis, “The compression cache: Using on-line compression to
and energy consumptions of benchmarks running without compres-                 extend physical memory,” in Proc. USENIX Conf., Jan. 1993.
sion, with LZO compression, and with PBPM compression under                [6] P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis, “The case for
different memory constraints. Note that adaptive memory manage-                compressed caching in virtual memory systems,” in Proc. USENIX
ment was enabled in both LZO and PBPM compression to ensure                    Conf., 1999, pp. 101–116.
fair comparison. In our experiments, each benchmark was executed           [7] L. Rizzo, “A very fast algorithm for RAM compression,” Operating
multiple times; the average results are reported.                              Systems Review, vol. 31, no. 2, pp. 36–45, Apr. 1997.
   As shown in Figure 4, when system RAM was reduced to 8 MB,              [8] I. C. Tuduce and T. Gross, “Adaptive main memory compression,” in
without CRAMES, all benchmarks suffered from significant perfor-                Proc. USENIX Conf., Apr. 2005.
                                                                           [9] M. Kjelso, M. Gooch, and S. Jones, “Performance evaluation of
mance degradation; the 512 by 512 matrix multiplication couldn’t               computer architectures with main memory data compression,” in J.
even execute due to memory constraints. However, with the help                 Systems Architecture, vol. 45, 1999, pp. 571–590.
of CRAMES, all benchmarks were able to execute with only slight           [10] “LZO real-time data compression library,”
performance and energy consumption penalties. Compared with                    http://www.oberhumer.com/opensource/lzo.
the base case in which system RAM is 20 MB and CRAMES is not              [11] C. Lee, M. Potkonjak, and W. H. M. Smith, “Mediabench: A tool for
used, PBPM compression results in an average performance penalty               evaluating and synthesizing multimedia and communications
of 2.1% and a worst-case performance penalty of 9.2%. This rep-                systems,” http://cares.icsl.ucla.edu/MediaBench.

Mais conteúdo relacionado

Mais procurados

151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...ijcsit
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...butest
 
Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...IAEME Publication
 
Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029marangburu42
 
Collaborative Memory Management for Reactive Sensor/Actor Systems
Collaborative Memory Management for Reactive Sensor/Actor SystemsCollaborative Memory Management for Reactive Sensor/Actor Systems
Collaborative Memory Management for Reactive Sensor/Actor SystemsSnehal Belsare Mokashi
 
ITT Project Information Technology Basic
ITT Project Information Technology BasicITT Project Information Technology Basic
ITT Project Information Technology BasicMayank Garg
 
Distributed system unit II according to syllabus of RGPV, Bhopal
Distributed system unit II according to syllabus of  RGPV, BhopalDistributed system unit II according to syllabus of  RGPV, Bhopal
Distributed system unit II according to syllabus of RGPV, BhopalNANDINI SHARMA
 
Built-in Self Repair for SRAM Array using Redundancy
Built-in Self Repair for SRAM Array using RedundancyBuilt-in Self Repair for SRAM Array using Redundancy
Built-in Self Repair for SRAM Array using RedundancyIDES Editor
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 

Mais procurados (19)

151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
151 A SURVEY OF DIFFERENT APPROACHES FOR OVERCOMING THE PROCESSOR-MEMORY BOTT...
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Computer Systems
Computer SystemsComputer Systems
Computer Systems
 
hetero_pim
hetero_pimhetero_pim
hetero_pim
 
Database backup 110810
Database backup 110810Database backup 110810
Database backup 110810
 
os mod1 notes
 os mod1 notes os mod1 notes
os mod1 notes
 
Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...
 
035
035035
035
 
Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...Nvram applications in the architectural revolutions of main memory implementa...
Nvram applications in the architectural revolutions of main memory implementa...
 
Sucet os module_5_notes
Sucet os module_5_notesSucet os module_5_notes
Sucet os module_5_notes
 
Open sap cst1_week_2_all_slides
Open sap cst1_week_2_all_slidesOpen sap cst1_week_2_all_slides
Open sap cst1_week_2_all_slides
 
Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029Mainmemoryfinal 161019122029
Mainmemoryfinal 161019122029
 
Main memoryfinal
Main memoryfinalMain memoryfinal
Main memoryfinal
 
Collaborative Memory Management for Reactive Sensor/Actor Systems
Collaborative Memory Management for Reactive Sensor/Actor SystemsCollaborative Memory Management for Reactive Sensor/Actor Systems
Collaborative Memory Management for Reactive Sensor/Actor Systems
 
ITT Project Information Technology Basic
ITT Project Information Technology BasicITT Project Information Technology Basic
ITT Project Information Technology Basic
 
Distributed system unit II according to syllabus of RGPV, Bhopal
Distributed system unit II according to syllabus of  RGPV, BhopalDistributed system unit II according to syllabus of  RGPV, Bhopal
Distributed system unit II according to syllabus of RGPV, Bhopal
 
Built-in Self Repair for SRAM Array using Redundancy
Built-in Self Repair for SRAM Array using RedundancyBuilt-in Self Repair for SRAM Array using Redundancy
Built-in Self Repair for SRAM Array using Redundancy
 
Lec2
Lec2Lec2
Lec2
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 

Destaque

Object and method exploration for embedded systems
Object and method exploration for embedded systemsObject and method exploration for embedded systems
Object and method exploration for embedded systemsMr. Chanuwan
 
Android Cloud to Device Messaging Framework at GTUG Stockholm
Android Cloud to Device Messaging Framework at GTUG StockholmAndroid Cloud to Device Messaging Framework at GTUG Stockholm
Android Cloud to Device Messaging Framework at GTUG StockholmJohan Nilsson
 
Analyzing memory usage and leaks
Analyzing memory usage and leaksAnalyzing memory usage and leaks
Analyzing memory usage and leaksRonnBlack
 
High-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareHigh-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareMr. Chanuwan
 
FOSS STHLM Android Cloud to Device Messaging
FOSS STHLM Android Cloud to Device MessagingFOSS STHLM Android Cloud to Device Messaging
FOSS STHLM Android Cloud to Device MessagingJohan Nilsson
 
Java lejos-multithreading
Java lejos-multithreadingJava lejos-multithreading
Java lejos-multithreadingMr. Chanuwan
 

Destaque (6)

Object and method exploration for embedded systems
Object and method exploration for embedded systemsObject and method exploration for embedded systems
Object and method exploration for embedded systems
 
Android Cloud to Device Messaging Framework at GTUG Stockholm
Android Cloud to Device Messaging Framework at GTUG StockholmAndroid Cloud to Device Messaging Framework at GTUG Stockholm
Android Cloud to Device Messaging Framework at GTUG Stockholm
 
Analyzing memory usage and leaks
Analyzing memory usage and leaksAnalyzing memory usage and leaks
Analyzing memory usage and leaks
 
High-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareHigh-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded Software
 
FOSS STHLM Android Cloud to Device Messaging
FOSS STHLM Android Cloud to Device MessagingFOSS STHLM Android Cloud to Device Messaging
FOSS STHLM Android Cloud to Device Messaging
 
Java lejos-multithreading
Java lejos-multithreadingJava lejos-multithreading
Java lejos-multithreading
 

Semelhante a High performance operating system controlled memory compression

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Performance Tuning
Performance TuningPerformance Tuning
Performance TuningJannet Peetz
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep DivesRush Shah
 
Virtual Memory vs Cache Memory
Virtual Memory vs Cache MemoryVirtual Memory vs Cache Memory
Virtual Memory vs Cache MemoryAshik Iqbal
 
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)DataCore APAC
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
03 Data Recovery - Notes
03 Data Recovery - Notes03 Data Recovery - Notes
03 Data Recovery - NotesKranthi
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage ReductionPerforce
 
Virtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged StorageVirtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged StorageDataCore Software
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesKarthik Iyr
 
Real time database compression optimization using iterative length compressio...
Real time database compression optimization using iterative length compressio...Real time database compression optimization using iterative length compressio...
Real time database compression optimization using iterative length compressio...csandit
 
REAL TIME DATABASE COMPRESSION OPTIMIZATION USING ITERATIVE LENGTH COMPRESSIO...
REAL TIME DATABASE COMPRESSION OPTIMIZATION USING ITERATIVE LENGTH COMPRESSIO...REAL TIME DATABASE COMPRESSION OPTIMIZATION USING ITERATIVE LENGTH COMPRESSIO...
REAL TIME DATABASE COMPRESSION OPTIMIZATION USING ITERATIVE LENGTH COMPRESSIO...cscpconf
 
CA assignment group.pptx
CA assignment group.pptxCA assignment group.pptx
CA assignment group.pptxHAIDERALICH3
 
Reliability and yield
Reliability and yield Reliability and yield
Reliability and yield rohitladdu
 
Symmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelSymmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelManoraj Pannerselum
 
Computer's clasification
Computer's clasificationComputer's clasification
Computer's clasificationMayraChF
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cacheVISHAL DONGA
 

Semelhante a High performance operating system controlled memory compression (20)

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
 
Paging-R.D.Sivakumar
Paging-R.D.SivakumarPaging-R.D.Sivakumar
Paging-R.D.Sivakumar
 
Paging-R.D.Sivakumar
Paging-R.D.SivakumarPaging-R.D.Sivakumar
Paging-R.D.Sivakumar
 
Virtual Memory vs Cache Memory
Virtual Memory vs Cache MemoryVirtual Memory vs Cache Memory
Virtual Memory vs Cache Memory
 
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
Virtual SAN - A Deep Dive into Converged Storage (technical whitepaper)
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
03 Data Recovery - Notes
03 Data Recovery - Notes03 Data Recovery - Notes
03 Data Recovery - Notes
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction
 
Virtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged StorageVirtual SAN- Deep Dive Into Converged Storage
Virtual SAN- Deep Dive Into Converged Storage
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphones
 
Real time database compression optimization using iterative length compressio...
Real time database compression optimization using iterative length compressio...Real time database compression optimization using iterative length compressio...
Real time database compression optimization using iterative length compressio...
 
REAL TIME DATABASE COMPRESSION OPTIMIZATION USING ITERATIVE LENGTH COMPRESSIO...
REAL TIME DATABASE COMPRESSION OPTIMIZATION USING ITERATIVE LENGTH COMPRESSIO...REAL TIME DATABASE COMPRESSION OPTIMIZATION USING ITERATIVE LENGTH COMPRESSIO...
REAL TIME DATABASE COMPRESSION OPTIMIZATION USING ITERATIVE LENGTH COMPRESSIO...
 
CA assignment group.pptx
CA assignment group.pptxCA assignment group.pptx
CA assignment group.pptx
 
Reliability and yield
Reliability and yield Reliability and yield
Reliability and yield
 
Symmetric multiprocessing and Microkernel
Symmetric multiprocessing and MicrokernelSymmetric multiprocessing and Microkernel
Symmetric multiprocessing and Microkernel
 
PowerAlluxio
PowerAlluxioPowerAlluxio
PowerAlluxio
 
Computer's clasification
Computer's clasificationComputer's clasification
Computer's clasification
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
 

Mais de Mr. Chanuwan

High level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesHigh level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesMr. Chanuwan
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareMr. Chanuwan
 
Performance testing based on time complexity analysis for embedded software
Performance testing based on time complexity analysis for embedded softwarePerformance testing based on time complexity analysis for embedded software
Performance testing based on time complexity analysis for embedded softwareMr. Chanuwan
 
Performance and memory profiling for embedded system design
Performance and memory profiling for embedded system designPerformance and memory profiling for embedded system design
Performance and memory profiling for embedded system designMr. Chanuwan
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designMr. Chanuwan
 
Software Architectural low energy
Software Architectural low energySoftware Architectural low energy
Software Architectural low energyMr. Chanuwan
 
Software performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designSoftware performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designMr. Chanuwan
 
A system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareA system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareMr. Chanuwan
 
Embedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareEmbedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareMr. Chanuwan
 
Performance prediction for software architectures
Performance prediction for software architecturesPerformance prediction for software architectures
Performance prediction for software architecturesMr. Chanuwan
 
Model-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyModel-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyMr. Chanuwan
 

Mais de Mr. Chanuwan (11)

High level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesHigh level programming of embedded hard real-time devices
High level programming of embedded hard real-time devices
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded software
 
Performance testing based on time complexity analysis for embedded software
Performance testing based on time complexity analysis for embedded softwarePerformance testing based on time complexity analysis for embedded software
Performance testing based on time complexity analysis for embedded software
 
Performance and memory profiling for embedded system design
Performance and memory profiling for embedded system designPerformance and memory profiling for embedded system design
Performance and memory profiling for embedded system design
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system design
 
Software Architectural low energy
Software Architectural low energySoftware Architectural low energy
Software Architectural low energy
 
Software performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designSoftware performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system design
 
A system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareA system for performance evaluation of embedded software
A system for performance evaluation of embedded software
 
Embedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareEmbedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded software
 
Performance prediction for software architectures
Performance prediction for software architecturesPerformance prediction for software architectures
Performance prediction for software architectures
 
Model-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyModel-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A Survey
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

High performance operating system controlled memory compression

  • 1. High-Performance Operating System Controlled Memory Compression Lei Yang† Haris Lekatsas‡ Robert P. Dick† †(l-yang,dickrp)@ece.northwestern.edu ‡lekatsas@nec-labs.com Northwestern University NEC Laboratories America Evanston, IL 60208 Princeton, NJ 08540 ABSTRACT able to applications without changing applications, without chang- ing hardware, and without significant performance degradation, for This article describes a new software-based on-line memory com- all realistic embedded applications. In this article, we describe two pression algorithm for embedded systems and presents a method techniques to further improve the performance of on-line software- of adaptively managing the uncompressed and compressed mem- based memory compression. ory regions during application execution. The primary goal of this We present a very fast, high-quality compression algorithm for work is to save memory in disk-less embedded systems, resulting in working data set pages. This algorithm, named pattern-based par- greater functionality, smaller size, and lower overall cost, without tial match (PBPM), explores frequent patterns that occur within modifying applications or hardware. In comparison with algorithms each word of memory and takes advantage of the similarities among that are commonly used in on-line memory compression, our new words by keeping a small two-way, hashed, set associative dictio- algorithm has a comparable compression ratio but is twice as fast. nary that is managed with a least-recently used (LRU) replacement The adaptive memory management scheme effectively responds to policy. PBPM has a compression ratio that is competitive with the the predicted needs of applications and prevents on-line memory best compression algorithms of the Lempel-Ziv family [2] while compression deadlock, permitting reliable and efficient compres- exhibiting lower run-time and memory overhead. sion for a wide range of applications. We have evaluated our tech- In addition, we present an adaptive compressed memory manage- nique on an embedded portable device and have found that the ment technique that predictively allocates memory for compressed memory available to applications can be increased by 150%, al- data. Experimental results show that our new pre-allocation method lowing the execution of applications with larger working data sets, is able to further increase available memory to applications by up or allowing existing applications to run with less physical memory. to 13% compared to the same system without pre-allocation. Categories and Subject Descriptors: C.3 [Special-Purpose and Application-Based Systems]: Real-time and embedded systems 2. Related work General Terms: Algorithms, Management, Performance A number of previous approaches incorporated compression into Keywords: Virtual memory, Compression the memory hierarchy for different goals. Main memory compres- sion techniques [3] insert a hardware compression/decompression 1. Introduction unit between cache and RAM. Data is stored uncompressed in cache, Every year, embedded system designers pack more functionality and compressed on-the-fly when transferred to memory. Code com- into less space, while meeting tight performance and power con- pression techniques [4] store instructions in compressed format and sumption constraints. Software running on some embedded de- decompress them during execution. Compression is usually done vices, such as cellular phones and PDAs, is becoming increasingly off-line and can be slow, while decompression is done during exe- complicated, requiring faster processors and more memory to exe- cution, usually by special hardware, and must be very fast. cute. In this work, we propose a method of increasing functionality Compressed caching [5,6] introduces a software cache to the vir- without changes to hardware or applications by making better use tual memory system that uses part of the memory to store data in of physical memory via on-line memory compression. compressed format. Swap Compression [7,8] compresses swapped Our existing memory compression framework, CRAMES [1], pages and stores them in a memory region that acts as a cache be- takes advantage of an operating system’s (OS’s) virtual memory tween memory and disk. The primary objective of both techniques infrastructure by storing swapped-out pages in compressed format. is to improve system performance by decreasing the number of page It dynamically adjusts the size of the compressed RAM area, pro- faults that must be serviced by hard disks. Both techniques require tecting applications capable of running without it from performance a backing store, i.e., hard disks, when the compressed cache is filled or energy consumption penalties. This memory compression tech- up. Unlike hardware-based main memory compression techniques, nique has been implemented as a loadable module for the Linux neither compressed caching nor swap compression is able to com- kernel and evaluated on a battery-powered embedded system. Ex- press code in memory. perimental results indicate that it is capable of doubling the amount IBM’s MXT technology [3] used a hardware parallelized deriva- of RAM available to applications. When used for this purpose, it tive of LZ77 [2]. Kjelso, et al. [9] designed X-Match, a hardware- has little or no impact on the execution times and energy consump- based dictionary coding algorithm. Wilson [6] proposed WKdm, a tions of applications capable of running without CRAMES. software-based dictionary coding algorithm. Rizzo [7] presented a Although this version of CRAMES that used existing compres- software-based algorithm that compresses in-RAM data by exploit- sion algorithms greatly increased usable memory for a wide range ing the high frequency of zero-valued data. of applications, using it to cut physical memory to 40% increased application execution time by 29%, in the worst case. This con- 3. Overview of CRAMES flicted with our goal of dramatically increasing the memory avail- This section briefly introduces the design and implementation of CRAMES to provide readers a context for the techniques presented This work is supported in part by NEC Laboratories America and in part by the NSF in this paper. To increase available memory to embedded systems, under award CNS-0347941. CRAMES selectively compresses data pages when the working data sets of processes exceed physical RAM capacity. When data in a Permission to make digital or hard copies of all or part of this work for compressed page is later required by a process, CRAMES locates personal or classroom use is granted without fee provided that copies are that page, decompresses it, and copies it back to the main mem- not made or distributed for profit or commercial advantage and that copies ory working area, allowing the process to continue executing. In bear this notice and the full citation on the first page. To copy otherwise, to order to minimize performance and energy consumption impact, republish, to post on servers or to redistribute to lists, requires prior specific CRAMES takes advantage of the OS virtual memory swapping permission and/or a fee. mechanism to decide which data pages to compress and when to DAC 2006, July 24–28, 2006, San Francisco, California, USA. perform compression and decompression. CRAMES requires an Copyright 2006 ACM 1-59593-381-6/06/0007 ...$5.00. MMU. However, no other special-purpose hardware is required.
  • 2. 100% xmxx Code Pattern Output Size (bits) Frequency 90% xzxz 00 zzzz 00 2 38.0% zxxz 01 xxxx 01BBBB 34 21.6% 80% xzzx 10 mmmm 10bbbb 6 11.2% zxzz 70% zzxz 1100 zzzx 1100B 12 9.3% 24 8.9% Percentage % xzzz 1101 mmxx 1101bbbbBB 60% xxzz 1110 mmmx 1110bbbbB 16 7.7% 50% xxzx 1111 zxzx 1111BB 20 3.1% xzxx 40% xxxz 30% zxxx zzxx Table 1: Pattern encoding in PBPM zxzx 20% mmmx encountered during memory compression. Numerical values are of- 10% mmxx ten small enough to be stored in 4, 8, or 16 bits, but are normally mmmm zzzx stored in full 32-bit words. Furthermore, numerical values tend to 0% xxxx be similar to other values in nearby locations. Likewise, pointers ) ) ) 1) 1) 1) 2) 4) ) 2) 4) 2) 4) ) ) ,2 ,4 often point to adjacent objects in memory, or are similar to other ,4 ,1 ,1 ,2 6, 2, 4, 2, 6, 4, 6, 4, 2, zzzz (8 (4 (8 28 56 28 (1 (3 (1 (6 (3 (1 (6 (3 (6 (1 (2 (1 Dictionary layout (size, level) pointers in nearby locations. In order to develop a reasonable set of frequent patterns, we ex- Figure 1: Frequent pattern histogram perimented with a 64 MB swap data file from a workstation running SuSE Linux 9.0. Various applications were executed to exhaust To ensure good performance, the compression algorithm used in physical memory and trigger swapping. Figure 1 shows the relative CRAMES must be highly efficient at both compression and decom- frequencies of patterns we evaluated. Below we specify the conven- pression. It must also provide a good compression ratio, yet require tions in describing the data and patterns, as well as the dictionary little working memory. In addition, CRAMES must efficiently or- management scheme we considered. ganize the compressed swap device to enable fast compressed page We consider each 32-bit word (four bytes) as an input, and rep- access and minimal memory waste. CRAMES dynamically ad- resent them with four symbols, each of which represents a byte. A justs the size of the compressed area during operation based on the ‘z’ represents a zero byte, an ‘x’ represents an arbitrary byte, and an amount of memory required so that applications capable of running ‘m’ represents a byte that matches with a dictionary entry. Follow- without memory compression do not suffer performance or energy ing this convention, ‘zzzz’ indicates an all-zero word, while ‘mmmx’ consumption penalties as a result of its use. In addition to data set indicates a partial match with one dictionary entry for which only compression, CRAMES also supports compression of arbitrary in- the lowest byte differs. RAM filesystems. To allow fast search and update operations, we maintain a hash- CRAMES has been implemented as a loadable Linux kernel mod- mapped dictionary. More specifically, the third byte of a word is ule for maximum portability and modularity, however it can easily hash-mapped to a 256 entry hash table, the contents of which are be ported to other modern OSs. The module was evaluated on a random indices that are within the range of the dictionary. Based next generation smartphone prototype and a battery-powered PDA on this hash function, we only need to consider four match patterns: using well-known batch benchmarks as well as interactive applica- ‘mmmm’ (full match), ‘mmmx’ (highest three bytes match), ‘mmxx’ tions with graphical user interfaces (GUIs). The results show that (highest two bytes match), and ‘xmxx’ (only the third byte matches). CRAMES is capable of dramatically increasing memory capacity Note that neither the hash table nor the dictionary need be stored with small performance and power consumption costs. with the compressed data. The hash table is static and the dynamic dictionary is regenerated automatically during decompression. 4. Pattern-based partial match compression We experimented with different dictionary sizes/layouts, e.g., 16- The original implementation of CRAMES used the LZO [10] algo- entry direct-mapped and 32-entry two-way set associative, etc. A rithm to compress data in memory. LZO is significantly faster than direct hash-mapped dictionary has the advantage of supporting fast many other general-purpose compression algorithms, e.g., LZW se- search and update: only a single hashing operation and lookup are ries algorithms, gzip, and bzip2. However, it is not designed for required per access. However, it has tightly limited memory. For memory compression and therefore does not fully exploit the regu- each hash target, only the most recently observed word is remem- larities of in-RAM data. In addition, LZO requires 64 KB of work- bered; the victim to be replaced is decided entirely by its hash tar- ing memory, a significant overhead on many memory-constrained get. In contrast, if a dictionary is maintained with move-to-front embedded systems. In summary, LZO is a general-purpose algo- strategy, its LRU entry is selected as the victim. Unfortunately, rithm with good compression ratio and performance. However, bet- searching in such a dictionary is slow. A set associative dictionary ter results are possible for the on-line memory compression appli- can enjoy the benefits of both LRU replacement and speed. When cation. In this paper, we analyze the regularities of in-RAM data a search miss followed by a dictionary update occurs, the oldest of and describe a new algorithm, named PBPM, that is extremely fast the dictionary entries sharing one hash target index is replaced. and well-suited for memory compression. As Figure 1 illustrates, zero words, ‘zzzz’, are the most frequent The PBPM algorithm is based on the observation that frequently- compressible pattern (38%), followed by one byte positive sign- encountered data patterns can be encoded with fewer bits to save extended words ‘zzzx’ (9.3%). ‘zxzx’ has a frequency of 2.8%. space. Scanning through the input data a word (32 bits) at a time, Other zero-related patterns are infrequent. As the dictionary size PBPM exploits patterns that occur frequently within each word of increases, dictionary match (including partial match) frequencies memory and searches for complete and partial matches with dic- do not increase much. While a set associative dictionary usually tionary entries to take advantage of the similarities among words. generates more matches than a direct hash-mapped dictionary with More specifically, (1) some patterns that are very frequent are en- the same overall size, a four-way set associative dictionary works coded using special bit sequences that are much shorter than the no better than a two-way set associative dictionary. original data, (2) patterns that do not fall into the above category 4.2. The PBPM compression algorithm and are found in a small dictionary are encoded using the index of their location in dictionary, and finally (3) patterns that do not fre- The PBPM compression and decompression algorithms are pre- quently occur and cannot be found in the dictionary are stored in sented in Algorithm 1. PBPM maintains a small two-way set asso- dictionary while the word contents are output. ciative dictionary DICT[] of 16 recently-seen words. An incoming word can fully match a dictionary entry, or match only the high- 4.1. In-RAM data patterns est three bytes or two bytes of a dictionary entry. These patterns Unlike general-purpose algorithms designed for text data, a special- occurred frequently during swap trace analysis. The patterns and purpose algorithm designed for in-RAM data compression must coding schemes are summarized in Table 1, which also reports the fully exploit the regularities present in memory. In-RAM data fre- actual frequency of each pattern observed in our swap data file when quently have certain patterns. For example, pages are usually zero- other infrequent patterns are ignored. In Algorithm 1 and column filled after being allocated. Therefore, runs of zeroes are commonly ‘Output’ of Table 1, ‘B’ represents a byte and ‘b’ represents a bit.
  • 3. 0.46 1.2e-004 4.0e-005 lzo lzo lzo Total memory allocated Average execution time pbpm pbpm 3.5e-005 pbpm Decompression Time (s) 0.44 1.0e-004 40 38 9 8.39 Compression Time (s) 8.25 lzrw lzrw lzrw Compression Ratio 0.42 3.0e-005 36 33 8 8.0e-005 32 7 2.5e-005 0.40 28 Time (second) Memory (MB) 6 6.0e-005 2.0e-005 24 0.38 5 4.47 1.5e-005 20 16 4.0e-005 16 4 0.36 1.0e-005 3 12 0.34 2.0e-005 2 5.0e-006 8 4 1 0.32 0.0e+000 0.0e+000 2048 4096 6144 8192 2048 4096 6144 8192 2048 4096 6144 8192 0 0 None CRAMES A−CRAMES None CRAMES A−CRAMES Block Size (bytes) Block Size (bytes) Block Size (bytes) Figure 2: Compression ratios and speeds of PBPM, LZO, and LZRW Figure 3: Performance of A-CRAMES Algorithm 1 PBPM (a) compression and (b) decompression We propose the following scheme to prevent on-line memory compression deadlock. CRAMES monitors the compressed area Require: IN, OUT word stream Require: IN, OUT word stream utilization and requests the allocation of a new memory chunk based Require: TAPE, INDX bit stream Require: TAPE, INDX bit stream Require: DATA byte stream Require: DATA byte stream on the saturation of current memory chunks. When the total amount 1: for word in range of IN do 1: Unpack(OUT) of memory in the compressed area is above a predefined fill ratio, 2: if word = zzzz then 2: for code in range of TAPE do CRAMES requests a new chunk from kernel. This request may 3: TAPE 00 3: if code = 00 then also be denied if the system memory is dangerously low. How- 4: else if word = zzzx then 4: OUT zzzz 5: TAPE 1100 5: else if code = 1100 then ever, even if this first request is denied, subsequent invocations 6: DATA B 6: B DATA 7: else if word = zxzx then 7: OUT zzzB of CRAMES will generate additional requests. After low-memory 8: TAPE 1111 8: else if code = 1111 then conditions cause applications to swap out pages to the compressed 9: DATA BB 9: BB DATA 10: else 10: OUT zBzB RAM device, more memory will be available and the preemptive 11: mmmm DICT[hash(word)] 11: else if code = 10 then compressed RAM device allocation requests will finally be suc- 12: if word = mmmm then 12: bbbb INDX cessful. We have experimented with different fill ratios and found 13: TAPE 10 13: OUT DICT[bbbb] 14: INDX bbbb 14: else if code = 1110 then that 7/8 is sufficient to permit successful requests of compressed 15: else if word = mmmx then 15: bbbb INDX RAM device memory allocation in all tested applications. This ra- 16: TAPE 1110 16: mmmm DICT[bbbb] 17: INDX bbbb 17: B DATA tio also results in little memory waste. Evaluation of this method 18: DATA B 18: OUT mmmB on a portable embedded system is presented in Section 6.2. 19: Insert word to DICT 19: Insert mmmB to DICT 20: else if word = mmxx then 20: else if code = 1101 then 21: TAPE 1101 21: bbbb INDX 22: INDX bbbb 22: mmmm DICT[bbbb] 6. Evaluation 23: DATA BB 23: BB DATA 24: Insert word to DICT 24: OUT mmBB In this section, we describe the evaluation methodology and results 25: else 25: Insert mmBB to DICT 26: TAPE 01 26: else if code = 01 then of the techniques proposed for high-performance on-line memory 27: DATA BBBB 27: BBBB DATA compression. More specifically, the following questions are exper- 28: Insert word to DICT 28: OUT BBBB 29: end if 29: Insert BBBB to DICT imentally evaluated: 30: end if 30: end if 31: end for 31: end for 1. Does PBPM provide a comparable compression ratio yet have 32: OUT Pack(TAPE,DATA,INDX) even lower performance costs than existing algorithms? 2. Does adaptive memory management enable CRAMES to pro- 5. Adaptive compressed memory management vide more memory to applications when system RAM is tightly constrained? As described in Section 3, CRAMES compresses the swapped-out 3. What is the overall performance of CRAMES using PBPM and data a page (4,096 bytes) at a time and stores them in a special com- adaptive memory management? pressed RAM device. Upon initialization, the compressed RAM device only requests a small memory chunk (usually 64 KB) and re- To evaluate the proposed techniques, we used a Sharp Zaurus SL- quests additional memory chunks as system memory requirements 5600 PDA. This battery-powered embedded system runs an embed- grow. Therefore, the size of a compressed RAM device dynam- ded version of Linux, has a 400 MHz Intel XScale PXA250 proces- ically adjusts during operation, adapting to the data memory re- sor, 32 MB of flash memory, and 32 MB of RAM. In our current quirements of the currently running applications. system configuration, 12 MB of RAM are used for uncompressed, battery-backed filesystem storage and 20 MB are available to kernel The above dynamic memory allocation strategy works well when and user applications. the system is not under extreme memory pressure. However, it suf- fers from a performance penalty when the system is dangerously 6.1. Quality and speed of the PBPM algorithm low on memory: applications and CRAMES compete for remain- We evaluated the compression ratio and speed of the PBPM algo- ing physical memory and there is no guarantee that an allocation rithm compared to two other compression algorithms that have been request will be satisfied. If physical memory is nearly exhausted, used for on-line memory compression: LZO and LZRW. Figure 2 an application may be unable to allocate additional memory. If illustrates the compression ratios (compressed block size divided CRAMES were able to allocate even a page of physical memory for its compressed memory region, it would be able to swap out by original block size) and execution times of evaluated algorithms. For these comparisons, the source file for compression is the swap (on average) two pages, allowing the application to proceed. How- data file (divided into uniform-sized blocks) used to identify the ever, CRAMES is in contention with the application, resulting in a frequent patterns in memory. The evaluation was performed on scenario we call on-line memory compression deadlock. a Linux Workstation with a 2.40 GHz Intel Pentium 4 processor. To avoid on-line memory compression deadlock, a compressed Note that OS-controlled on-line memory compression is a symmet- RAM device needs to be predictive during its requests for addi- ric application, i.e., a memory page is decompressed exactly once tional memory, i.e., it cannot wait until no existing chunks can al- every time it is compressed. Therefore, the overall, symmetric, per- locate a fit slot for the incoming data. This subtle issue comprises a formance of a compression algorithm is the critical performance difference between our technique and compressed caching as well metric. Overall, PBPM achieves a 200% speedup over LZO and as swap compression, in which hard drives serve as backing store LZRW. Yet the compression ratio achieved by PBPM is comparable to which pages can be moved as soon as (or even earlier than) the with that of LZO and LZRW. We believe that PBPM is especially compressed area is stuffed, so that the compressed memory is al- suitable for on-line memory compression because of its extremely ways available to applications. fast symmetric compression and good compression ratio.
  • 4. Benchmark Description RAM Adpcm Jpeg Mpeg2 Matrix Mul. (MB) w.o. LZO PBPM w.o. LZO PBPM w.o. LZO PBPM w.o. LZO PBPM Adpcm: Speech compression Execution Time (s) Source: MediaBench 8 4.83 1.69 1.43 0.71 0.26 0.23 79.35 80.30 77.96 n.a 39.26 38.68 Data size: 24 KB 9 3.69 1.35 1.26 0.44 0.21 0.21 76.80 76.83 74.04 n.a 37.40 38.24 10 1.41 1.34 1.36 0.23 0.21 0.21 79.06 76.93 75.32 59.11 39.56 37.18 Code size: 4 KB 11 1.37 1.40 1.40 0.26 0.25 0.21 80.57 76.81 76.83 44.44 38.42 42.65 12 1.37 1.31 1.32 0.24 0.21 0.19 76.79 76.94 76.95 41.72 38.73 43.96 Jpeg: Image encoding 20 1.31 1.30 1.30 0.23 0.21 0.22 76.60 76.77 76.76 43.02 41.41 42.97 Source: MediaBench Power Consumption (W) Data size: 176 KB 8 2.13 2.13 2.13 2.15 2.16 2.15 2.41 2.41 2.51 n.a 2.26 2.29 Code size: 72 KB 9 2.10 2.10 2.13 2.15 2.02 2.07 2.41 2.40 2.50 n.a 2.26 2.29 10 2.09 2.10 2.09 2.00 1.99 2.04 2.39 2.40 2.48 2.24 2.25 2.29 Mpeg2: Video CODEC 11 2.12 2.09 2.13 2.05 2.04 2.07 2.40 2.40 2.50 2.26 2.25 2.29 Source: MediaBench 12 2.09 2.13 2.11 2.03 2.05 2.10 2.40 2.41 2.55 2.25 2.25 2.29 Data size: 416 KB 20 2.11 2.09 2.18 2.15 2.02 2.24 2.42 2.43 2.57 2.28 2.27 2.29 Code size: 48 KB Energy Consumption (J) 8 10.34 3.60 3.04 1.51 0.56 0.49 190.99 193.42 195.71 n.a 88.74 88.62 Matrix Mul.: 512 by 512 matrix 9 7.75 2.84 2.68 0.94 0.42 0.43 185.38 184.55 185.10 n.a 84.70 87.64 multiplication 10 2.94 2.79 2.85 0.47 0.42 0.42 188.62 184.34 186.42 131.05 88.99 85.01 Data size: 2948 KB 11 2.89 2.93 2.97 0.54 0.52 0.44 193.10 184.69 191.94 100.01 86.38 97.79 12 2.86 2.79 2.79 0.49 0.43 0.41 184.45 185.74 196.33 93.65 86.94 100.81 Code size: 4 KB 20 2.75 2.72 2.82 0.48 0.43 0.49 185.72 186.56 197.26 98.27 94.07 98.39 Figure 4: Overall performance of CRAMES 6.2. Effectiveness of adaptive memory management resents a substantial improvement over LZO, for which the aver- In order to determine the effectiveness of adaptive memory man- age performance penalty is 9.5% and the worst-case performance agement in providing more memory to applications under signifi- penalty can be as high as 29%. cant memory pressure, we designed the following experiments. We wrote a ‘memeater’ program that continuously requests 1 MB of 7. Conclusions and acknowledgments memory at a time and fills the allocated memory with random num- High-performance OS controlled memory compression can assist bers (including zero runs with similar frequency to that observed embedded system designers to optimize hardware design for typi- in real swap traces), until an allocation request fails. Memeater cal software memory requirements while also supporting (sets of) was then executed on Zaurus under three different system settings: applications with larger data sets. In this paper, we proposed and without using CRAMES (none), using CRAMES without adap- evaluated a fast software-based compression algorithm for use in tive memory management (CRAMES), and using CRAMES with this application. This algorithm provides comparable compression adaptive memory management (A-CRAMES). Figure 3 presents ratios to existing algorithms used in on-line memory compression the total memory allocated and average execution times under the with significantly better symmetric performance. We also presented three system settings (memeater was executed multiple times un- an adaptive compressed memory management scheme to prevent der each setting). Without CRAMES, the system was only able to on-line memory compression deadlock, permitting reliable and ef- provide 16 MB of memory to memeater. With CRAMES, 33 MB ficient on-line memory compression for a wide range of applica- of memory were provided and the execution time was proportional tions. Experimental results indicate that the improved CRAMES to the amount of memory allocated, i.e., no delay was observed. allows applications to execute with only slight penalties even when Furthermore, when adaptive memory management is enabled (A- system RAM is reduced to 40% of its original size. CRAMES), 38 MB of memory were allocated with no additional We would like to acknowledge Dr. Srimat Chakradhar of NEC cost. These results support our claim in Section 5 that A-CRAMES Laboratories America for his support and technical advice. We helps to prevent on-line data compression deadlock. would also like to acknowledge Hui Ding of Northwestern Uni- 6.3. Overall performance of the improved CRAMES versity for his suggestions on efficiently implementing PBPM. Embedded system with high-performance on-line memory com- 8. References pression can be designed with less RAM and still support desired applications. In order to evaluate the impact of using CRAMES [1] L. Yang, et al., “CRAMES: Compressed RAM for embedded to reduce physical RAM, we artificially constrained the memory systems,” in Proc. Int. Conf. Hardware/Software Codesign and size of a Zaurus with a kernel module that permanently reserves System Synthesis, Sept. 2005. a certain amount of physical memory. The memory allocated to [2] J. Ziv and A. Lempel, “A universal algorithm for sequential data a kernel module cannot be swapped out and therefore is not com- compression,” IEEE Transactions on Information Theory, vol. 23, pressed by CRAMES. This guarantees the fairness of our compari- no. 3, pp. 337–343, 1977. son. With reduced physical RAM, we measured and compared the [3] B. Tremaine, et al., “IBM memory expansion technology,” IBM run times, power consumptions, and energy consumptions of four Journal of Research and Development, vol. 45, no. 2, Mar. 2001. [4] H. Lekatsas, J. Henkel, and W. Wolf, “Code compression for low batch benchmarks, i.e., three applications from MediaBench [11] power embedded system design,” in Proc. Design Automation Conf., (Adpcm, Jpeg, Mpeg2) and a 512 by 512 matrix multiplication ap- June 2000, pp. 294–299. plication. Figure 4 shows execution times, power consumptions, [5] F. Douglis, “The compression cache: Using on-line compression to and energy consumptions of benchmarks running without compres- extend physical memory,” in Proc. USENIX Conf., Jan. 1993. sion, with LZO compression, and with PBPM compression under [6] P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis, “The case for different memory constraints. Note that adaptive memory manage- compressed caching in virtual memory systems,” in Proc. USENIX ment was enabled in both LZO and PBPM compression to ensure Conf., 1999, pp. 101–116. fair comparison. In our experiments, each benchmark was executed [7] L. Rizzo, “A very fast algorithm for RAM compression,” Operating multiple times; the average results are reported. Systems Review, vol. 31, no. 2, pp. 36–45, Apr. 1997. As shown in Figure 4, when system RAM was reduced to 8 MB, [8] I. C. Tuduce and T. Gross, “Adaptive main memory compression,” in without CRAMES, all benchmarks suffered from significant perfor- Proc. USENIX Conf., Apr. 2005. [9] M. Kjelso, M. Gooch, and S. Jones, “Performance evaluation of mance degradation; the 512 by 512 matrix multiplication couldn’t computer architectures with main memory data compression,” in J. even execute due to memory constraints. However, with the help Systems Architecture, vol. 45, 1999, pp. 571–590. of CRAMES, all benchmarks were able to execute with only slight [10] “LZO real-time data compression library,” performance and energy consumption penalties. Compared with http://www.oberhumer.com/opensource/lzo. the base case in which system RAM is 20 MB and CRAMES is not [11] C. Lee, M. Potkonjak, and W. H. M. Smith, “Mediabench: A tool for used, PBPM compression results in an average performance penalty evaluating and synthesizing multimedia and communications of 2.1% and a worst-case performance penalty of 9.2%. This rep- systems,” http://cares.icsl.ucla.edu/MediaBench.