SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
Faster and Smaller N -Gram Language Models

                                    Adam Pauls Dan Klein
                                   Computer Science Division
                                 University of California, Berkeley
                             {adpauls,klein}@cs.berkeley.edu



                     Abstract                         and Raj, 2001; Hsu and Glass, 2008), our meth-
                                                      ods are conceptually based on tabular trie encodings
    N -gram language models are a major resource      wherein each n-gram key is stored as the concatena-
    bottleneck in machine translation. In this pa-    tion of one word (here, the last) and an offset encod-
    per, we present several language model imple-     ing the remaining words (here, the context). After
    mentations that are both highly compact and       presenting a bit-conscious basic system that typifies
    fast to query. Our fastest implementation is
                                                      such approaches, we improve on it in several ways.
    as fast as the widely used SRILM while re-
    quiring only 25% of the storage. Our most         First, we show how the last word of each entry can
    compact representation can store all 4 billion    be implicitly encoded, almost entirely eliminating
    n-grams and associated counts for the Google      its storage requirements. Second, we show that the
    n-gram corpus in 23 bits per n-gram, the most     deltas between adjacent entries can be efficiently en-
    compact lossless representation to date, and      coded with simple variable-length encodings. Third,
    even more compact than recent lossy compres-      we investigate block-based schemes that minimize
    sion techniques. We also discuss techniques
                                                      the amount of compressed-stream scanning during
    for improving query speed during decoding,
    including a simple but novel language model
                                                      lookup.
    caching technique that improves the query            To speed up our language models, we present two
    speed of our language models (and SRILM)          approaches. The first is a front-end cache. Caching
    by up to 300%.                                    itself is certainly not new to language modeling, but
                                                      because well-tuned LMs are essentially lookup ta-
                                                      bles to begin with, naive cache designs only speed
1   Introduction                                      up slower systems. We present a direct-addressing
For modern statistical machine translation systems,   cache with a fast key identity check that speeds up
language models must be both fast and compact.        our systems (or existing fast systems like the widely-
The largest language models (LMs) can contain as      used, speed-focused SRILM) by up to 300%.
many as several hundred billion n-grams (Brants          Our second speed-up comes from a more funda-
et al., 2007), so storage is a challenge. At the      mental change to the language modeling interface.
same time, decoding a single sentence can trig-       Where classic LMs take word tuples and produce
ger hundreds of thousands of queries to the lan-      counts or probabilities, we propose an LM that takes
guage model, so speed is also critical. As al-        a word-and-context encoding (so the context need
ways, trade-offs exist between time, space, and ac-   not be re-looked up) and returns both the probabil-
curacy, with many recent papers considering small-    ity and also the context encoding for the suffix of the
but-approximate noisy LMs (Chazelle et al., 2004;     original query. This setup substantially accelerates
Guthrie and Hepple, 2010) or small-but-slow com-      the scrolling queries issued by decoders, and also
pressed LMs (Germann et al., 2009).                   exploits language model state equivalence (Li and
   In this paper, we present several lossless meth-   Khudanpur, 2008).
ods for compactly but efficiently storing large LMs       Overall, we are able to store the 4 billion n-grams
in memory. As in much previous work (Whittaker        of the Google Web1T (Brants and Franz, 2006) cor-
pus, with associated counts, in 10 GB of memory,         we encode counts, probabilities and/or back-off
which is smaller than state-of-the-art lossy language    weights in our model. In general, the number of bits
model implementations (Guthrie and Hepple, 2010),        per value required to encode all value ranks for a
and significantly smaller than the best published         given language model will vary – we will refer to
lossless implementation (Germann et al., 2009). We       this variable as v .
are also able to simultaneously outperform SRILM
in both total size and speed. Our LM toolkit, which      2.2    Trie-Based Language Models
is implemented in Java and compatible with the stan-     The data structure of choice for the majority of
dard ARPA file formats, is available on the web.1         modern language model implementations is a trie
                                                         (Fredkin, 1960). Tries or variants thereof are
2       Preliminaries                                    implemented in many LM tool kits, including
Our goal in this paper is to provide data structures     SRILM (Stolcke, 2002), IRSTLM (Federico and
that map n-gram keys to values, i.e. probabilities       Cettolo, 2007), CMU SLM (Whittaker and Raj,
or counts. Maps are fundamental data structures          2001), and MIT LM (Hsu and Glass, 2008). Tries
and generic implementations of mapping data struc-       represent collections of n-grams using a tree. Each
tures are readily available. However, because of the     node in the tree encodes a word, and paths in the
sheer number of keys and values needed for n-gram        tree correspond to n-grams in the collection. Tries
language modeling, generic implementations do not        ensure that each n-gram prefix is represented only
work efficiently “out of the box.” In this section,       once, and are very efficient when n-grams share
we will review existing techniques for encoding the      common prefixes. Values can also be stored in a trie
keys and values of an n-gram language model, tak-        by placing them in the appropriate nodes.
ing care to account for every bit of memory required        Conceptually, trie nodes can be implemented as
by each implementation.                                  records that contain two entries: one for the word
   To provide absolute numbers for the storage re-       in the node, and one for either a pointer to the par-
quirements of different implementations, we will         ent of the node or a list of pointers to children. At
use the Google Web1T corpus as a benchmark. This         a low level, however, naive implementations of tries
corpus, which is on the large end of corpora typically   can waste significant amounts of space. For exam-
employed in language modeling, is a collection of        ple, the implementation used in SRILM represents a
nearly 4 billion n-grams extracted from over a tril-     trie node as a C struct containing a 32-bit integer
lion tokens of English text, and has a vocabulary of     representing the word, a 64-bit memory2 pointer to
about 13.5 million words.                                the list of children, and a 32-bit floating point num-
                                                         ber representing the value stored at a node. The total
2.1 Encoding Values                                      storage for a node alone is 16 bytes, with additional
                                                         overhead required to store the list of children. In
In the Web1T corpus, the most frequent n-gram
                                                         total, the most compact implementation in SRILM
occurs about 95 billion times. Storing this count
                                                         uses 33 bytes per n-gram of storage, which would
explicitly would require 37 bits, but, as noted by
                                                         require around 116 GB of memory to store Web1T.
Guthrie and Hepple (2010), the corpus contains only
                                                            While it is simple to implement a trie node in this
about 770 000 unique counts, so we can enumerate
                                                         (already wasteful) way in programming languages
all counts using only 20 bits, and separately store an
                                                         that offer low-level access to memory allocation like
array called the value rank array which converts the
                                                         C/C++, the situation is even worse in higher level
rank encoding of a count back to its raw count. The
                                                         programming languages. In Java, for example, C-
additional array is small, requiring only about 3MB,
                                                         style structs are not available, and records are
but we save 17 bits per n-gram, reducing value stor-
                                                         most naturally implemented as objects that carry an
age from around 16GB to about 9GB for Web1T.
                                                         additional 64 bits of overhead.
   We can rank encode probabilities and back-offs in
                                                            2
the same way, allowing us to be agnostic to whether           While 32-bit architectures are still in use today, their lim-
                                                         ited address space is insufficient for modern language models
    1
        http://code.google.com/p/berkeleylm/             and we will assume all machines use a 64-bit architecture.
n−1
   Despite its relatively large storage requirements,                  encode c(w1 ). We will refer to this encoding as a
the implementation employed by SRILM is still                          context encoding.
widely in use today, largely because of its speed – to                    Note that typically, n-grams are encoded in tries
our knowledge, SRILM is the fastest freely available                   in the reverse direction (first-rest instead of last-
language model implementation. We will show that                       rest), which enables a more efficient computation of
we can achieve access speeds comparable to SRILM                       back-offs. In our implementations, we found that the
but using only 25% of the storage.                                     speed improvement from switching to a first-rest en-
                                                                       coding and implementing more efficient queries was
2.3 Implicit Tries                                                     modest. However, as we will see in Section 4.2, the
A more compact implementation of a trie is de-                         last-rest encoding allows us to exploit the scrolling
scribed in Whittaker and Raj (2001). In their imple-                   nature of queries issued by decoders, which results
mentation, nodes in a trie are represented implicitly                  in speedups that far outweigh those achieved by re-
as entries in an array. Each entry encodes a word                      versing the trie.
with enough bits to index all words in the language
model (24 bits for Web1T), a quantized value, and                      3 Language Model Implementations
a 32-bit3 offset that encodes the contiguous block                     In the previous section, we reviewed well-known
of the array containing the children of the node.                      techniques in language model implementation. In
Note that 32 bits is sufficient to index all n-grams in                 this section, we combine these techniques to build
Web1T; for larger corpora, we can always increase                      simple data structures in ways that are to our knowl-
the size of the offset.                                                edge novel, producing language models with state-
   Effectively, this representation replaces system-                   of-the-art memory requirements and speed. We will
level memory pointers with offsets that act as logical                 also show that our data structures can be very effec-
pointers that can reference other entries in the array,                tively compressed by implicitly encoding the word
rather than arbitrary bytes in RAM. This represen-                     wn , and further compressed by applying a variable-
tation saves space because offsets require fewer bits                  length encoding on context deltas.
than memory pointers, but more importantly, it per-
mits straightforward implementation in any higher-                     3.1   Sorted Array
level language that provides access to arrays of inte-
                                                                       A standard way to implement a map is to store an
gers.4
                                                                       array of key/value pairs, sorted according to the key.
2.4 Encoding n-grams                                                   Lookup is carried out by performing binary search
                                                                       on a key. For an n-gram language model, we can ap-
Hsu and Glass (2008) describe a variant of the im-                     ply this implementation with a slight modification:
plicit tries of Whittaker and Raj (2001) in which                      we need n sorted arrays, one for each n-gram order.
each node in the trie stores the prefix (i.e. parent).                                                n−1
                                                                       We construct keys (wn , c(w1 )) using the context
This representation has the property that we can re-                   encoding described in the previous section, where
                       n
fer to each n-gram w1 by its last word wn and the                      the context offsets c refer to entries in the sorted ar-
            n−1                  n−1
offset c(w1 ) of its prefix w1 , often called the                       ray of (n − 1)-grams. This data structure is shown
context. At a low-level, we can efficiently encode                      graphically in Figure 1.
                   n−1
this pair (wn , c(w1 )) as a single 64-bit integer,                       Because our keys are sorted according to their
where the first 24 bits refer to wn and the last 40 bits                context-encoded representation, we cannot straight-
   3
      The implementation described in the paper represents each        forwardly answer queries about an n-gram w with-
32-bit integer compactly using only 16 bits, but this represen-        out first determining its context encoding. We can
tation is quite inefficient, because determining the full 32-bit        do this efficiently by building up the encoding in-
offset requires a binary search in a look up table.                    crementally: we start with the context offset of the
    4
      Typically, programming languages only provide support
for arrays of bytes, not bits, but it is of course possible to simu-
                                                                       unigram w1 , which is simply its integer representa-
late arrays with arbitrary numbers of bits using byte arrays and       tion, and use that to form the context encoding of the
bit manipulation.                                                                2
                                                                       bigram w1 = (w2 , c(w1 )). We can find the offset of
3-grams                             2-grams                      1-grams
                                 .                                   .
                                 .                                   .
                     w           c      val                w         c       val              w          val
                    1933    15176583                      6879 00004498                              .
                    1933    15176585                      6879 00004502                              .
                    1933    15176593                      6879 00004530                    “slept”
                                                                                                     .
           “slept” 1933     15176613              “cat”   6879 00004568
                                                                                                     .
                    1933    15179801                      6879 00004588
                                                                                            “left”
                    1933    15180051                      6879 00004598                     “the”
                    1933     15180053                     6879 00004668                     “had”

                    1935    15176585                      6880 00004669                              .
                                                                                                     .
            “ran” 1935      15176589             “dog”    6880 00004568
                                                                                            “dog”
                    1935    15176591                      6880 00004577
                                 .                                    .                              .
                                 .                                    .                              .
                    24           40       v
                    bits        bits     bits
                           64 bits


Figure 1: Our S ORTED implementation of a trie. The dotted paths correspond to “the cat slept”, “the cat ran”, and “the
dog ran”. Each node in the trie is an entry in an array with 3 parts: w represents the word at the node; val represents
the (rank encoded) value; and c is an offset in the array of n − 1 grams that represents the parent (prefix) of a node.
Words are represented as offsets in the unigram array.
the bigram using binary search, and form the context                               n
                                                             insert an n-gram w1 into the hash table, we must
encoding of the trigram, and so on. Note, however,           form its context encoding incrementally from the
that if our queries arrive in context-encoded form,          offset of w1 . However, unlike the sorted array im-
queries are faster since they involve only one binary        plementation, at query time, we only need to be
search in the appropriate array. We will return to this      able to check equality between the query key wn =   1
later in Section 4.2                                         (wn , c(w1 )) and a key w1n = (wn , c(w1n−1 )) in
                                                                       n−1

   This implementation, S ORTED, uses 64 bits for            the table. Equality can easily be checked by first
the integer-encoded keys and v bits for the values.          checking if wn = wn , then recursively checking
Lookup is linear in the length of the key and log-           equality between wn−1 and w1n−1 , though again,
                                                                                   1
arithmic in the number of n-grams. For Web1T                 equality is even faster if the query is already context-
(v = 20), the total storage is 10.5 bytes/n-gram or          encoded.
about 37GB.                                                     This H ASH data structure also uses 64 bits for
                                                             integer-encoded keys and v bits for values. How-
3.2 Hash Table                                               ever, to avoid excessive hash collisions, we also al-
Hash tables are another standard way to implement            locate additional empty space according to a user-
associative arrays. To enable the use of our context         defined parameter that trades off speed and time –
encoding, we require an implementation in which              we used about 40% extra space in our experiments.
we can refer to entries in the hash table via array          For Web1T, the total storage for this implementation
offsets. For this reason, we use an open address hash        is 15 bytes/n-gram or about 53 GB total.
map that uses linear probing for collision resolution.          Look up in a hash map is linear in the length of
   As in the sorted array implementation, in order to        an n-gram and constant with respect to the number
of n-grams. Unlike the sorted array implementa-               (a) Context-Encoding            (b) Context Deltas              (c) Bits Required
                                                               w         c       val        !w        !c       val      |!w|       |!c|      |val|
tion, the hash table implementation also permits ef-          1933 15176585       3         1933 15176585       3        24          40           3
ficient insertion and deletion, making it suitable for         1933   15176587     2          +0        +2       1         2          3            3

stream-based language models (Levenberg and Os-               1933   15176593     1          +0        +5       1         2          3            3
                                                              1933   15176613     8          +0       +40       8         2          9            6
borne, 2009).                                                 1933   15179801     1          +0       +188      1         2          12           3
                                                              1935   15176585    298         +2    15176585    298        4          36       15
3.3 Implicitly Encoding wn                                    1935   15176589     1          +0        +4       1         2          6            3
                                                                                              Value rank
The context encoding we have used thus far still                                              for header
                                                              (d) Compressed Array                key
wastes space. This is perhaps most evident in the
                                                              1933 15176585 563097887 956         3   0     +0 +2    2 +0 +5 1      +0 +40 8          . . .
sorted array representation (see Figure 1): all n-
                                                                                 Logical Number True if
grams ending with a particular word wi are stored               Header
                                                                 key             offset of of bits all !w in
                                                                                this block in this block are
contiguously. We can exploit this redundancy by                                            block       0

storing only the context offsets in the main array,
using as many bits as needed to encode all context         Figure 2: Compression using variable-length encoding.
offsets (32 bits for Web1T). In auxiliary arrays, one      (a) A snippet of an (uncompressed) context-encoded ar-
for each n-gram order, we store the beginning and          ray. (b) The context and word deltas. (c) The number
end of the range of the trie array in which all (wi , c)   of bits required to encode the context and word deltas as
                                                           well as the value ranks. Word deltas use variable-length
keys are stored for each wi . These auxiliary arrays
                                                           block coding with k = 1, while context deltas and value
are negligibly small – we only need to store 2n off-       ranks use k = 2. (d) A snippet of the compressed encod-
sets for each word.                                        ing array. The header is outlined in bold.
   The same trick can be applied in the hash table
implementation. We allocate contiguous blocks of           of the key array used in our sorted array implemen-
the main array for n-grams which all share the same        tation. While we have already exploited the fact that
last word wi , and distribute keys within those ranges     the 24 word bits repeat in the previous section, we
using the hashing function.                                note here that consecutive context offsets tend to be
   This representation reduces memory usage for            quite close together. We found that for 5-grams, the
keys from 64 bits to 32 bits, reducing overall storage     median difference between consecutive offsets was
for Web1T to 6.5 bytes/n-gram for the sorted imple-        about 50, and 90% of offset deltas were smaller than
mentation and 9.1 bytes for the hashed implementa-         10000. By using a variable-length encoding to rep-
tion, or about 23GB and 32GB in total. It also in-         resent these deltas, we should require far fewer than
creases query speed in the sorted array case, since to     32 bits to encode context offsets.
find (wi , c), we only need to search the range of the         We used a very simple variable-length coding to
array over which wi applies. Because this implicit         encode offset deltas, word deltas, and value ranks.
encoding reduces memory usage without a perfor-            Our encoding, which is referred to as “variable-
mance cost, we will assume its use for the rest of         length block coding” in Boldi and Vigna (2005),
this paper.                                                works as follows: we pick a (configurable) radix
                                                           r = 2k . To encode a number m, we determine the
3.4 A Compressed Implementation                            number of digits d required to express m in base r.
3.4.1 Variable-Length Coding                               We write d in unary, i.e. d − 1 zeroes followed by
   The distribution of value ranks in language mod-        a one. We then write the d digits of m in base r,
eling is Zipfian, with far more n-grams having low          each of which requires k bits. For example, using
counts than high counts. If we ensure that the value       k = 2, we would encode the decimal number 7 as
rank array sorts raw values by descending order of         010111. We can choose k separately for deltas and
frequency, then we expect that small ranks will oc-        value indices, and also tune these parameters to a
cur much more frequently than large ones, which we         given language model.
can exploit with a variable-length encoding.                  We found this encoding outperformed other
   To compress n-grams, we can exploit the context         standard prefix codes,           including Golomb
encoding of our keys. In Figure 2, we show a portion       codes (Golomb, 1966; Church et al., 2007)
and Elias γ and δ codes. We also experimented                    6GB less than the storage required by Germann et
with the ζ codes of Boldi and Vigna (2005), which                al. (2009), which is the best published lossless com-
modify variable-length block codes so that they                  pression to date.
are optimal for certain power law distributions.
We found that ζ codes performed no better than                   4 Speeding up Decoding
variable-length block codes and were slightly more
complex. Finally, we found that Huffman codes                    In the previous section, we provided compact and
outperformed our encoding slightly, but came at a                efficient implementations of associative arrays that
much higher computational cost.                                  allow us to query a value for an arbitrary n-gram.
                                                                 However, decoders do not issue language model re-
3.4.2 Block Compression                                          quests at random. In this section, we show that lan-
   We could in principle compress the entire array of            guage model requests issued by a standard decoder
key/value pairs with the encoding described above,               exhibit two patterns we can exploit: they are highly
but this would render binary search in the array im-             repetitive, and also exhibit a scrolling effect.
possible: we cannot jump to the mid-point of the ar-
ray since in order to determine what key lies at a par-          4.1   Exploiting Repetitive Queries
ticular point in the compressed bit stream, we would             In a simple experiment, we recorded all of the
need to know the entire history of offset deltas.                language model queries issued by the Joshua de-
   Instead, we employ block compression, a tech-                 coder (Li et al., 2009) on a 100 sentence test set.
nique also used by Harb et al. (2009) for smaller                Of the 31 million queries, only about 1 million were
language models. In particular, we compress the                  unique. Therefore, we expect that keeping the re-
key/value array in blocks of 128 bytes. At the be-               sults of language model queries in a cache should be
ginning of the block, we write out a header consist-             effective at reducing overall language model latency.
ing of: an explicit 64-bit key that begins the block;               To this end, we added a very simple cache to
a 32-bit integer representing the offset of the header           our language model. Our cache uses an array of
key in the uncompressed array;5 the number of bits               key/value pairs with size fixed to 2b − 1 for some
of compressed data in the block; and the variable-               integer b (we used 24). We use a b-bit hash func-
length encoding of the value rank of the header key.             tion to compute the address in an array where we
The remainder of the block is filled with as many                 will always place a given n-gram and its fully com-
compressed key/value pairs as possible. Once the                 puted language model score. Querying the cache is
block is full, we start a new block. See Figure 2 for            straightforward: we check the address of a key given
a depiction.                                                     by its b-bit hash. If the key located in the cache ar-
   When we encode an offset delta, we store the delta            ray matches the query key, then we return the value
of the word portion of the key separately from the               stored in the cache. Otherwise, we fetch the lan-
delta of the context offset. When an entire block                guage model probability from the language model
shares the same word portion of the key, we set a                and place the new key and value in the cache, evict-
single bit in the header that indicates that we do not           ing the old key in the process. This scheme is often
encode any word deltas.                                          called a direct-mapped cache because each key has
   To find a key in this compressed array, we first                exactly one possible address.
perform binary search over the header blocks (which                 Caching n-grams in this way reduces overall la-
are predictably located every 128 bytes), followed               tency for two reasons: first, lookup in the cache is
by a linear search within a compressed block.                    extremely fast, requiring only a single evaluation of
   Using k = 6 for encoding offset deltas and k = 5              the hash function, one memory lookup to find the
for encoding value ranks, this C OMPRESSED im-                   cache key, and one equality check on the key. In
plementation stores Web1T in less than 3 bytes per               contrast, even our fastest (H ASH) implementation
n-gram, or about 10.2GB in total. This is about                  may have to perform multiple memory lookups and
   5
    We need this because n-grams refer to their contexts using   equality checks in order to resolve collisions. Sec-
array offsets.                                                   ond, when calculating the probability for an n-gram
the cat + fell down                  W MT 2010                     W EB 1T



       Representation
                        the cat fell              LM   0.76
                                                               Order     #n-grams         Order        #n-grams
          Explicit                                             1gm       4,366,395        1gm         13,588,391
                        cat fell down             LM   0.12    2gm      61,865,588        2gm        314,843,401
                          “the cat”
                                                               3gm     123,158,761        3gm        977,069,902
                        18569876 fell             LM   0.76    4gm     217,869,981        4gm      1,313,818,354
       Encoding




                                                               5gm     269,614,330        5gm      1,176,470,663
       Context




                                        3576410
                          “cat fell”                           Total   676,875,055        Total    3,795,790,711
                        35764106 down             LM   0.12
                                                              Table 1: Sizes of the two language models used in our
                                                              experiments.
Figure 3: Queries issued when scoring trigrams that are
created when a state with LM context “the cat” combines
with “fell down”. In the standard explicit representation     n-grams exhibit a scrolling effect, shown in Fig-
of an n-gram as list of words, queries are issued atom-       ure 3: the n − 1 suffix words of one n-gram form
ically to the language model. When using a context-           the n − 1 prefix words of the next.
encoding, a query from the n-gram “the cat fell” returns
the context offset of “cat fell”, which speeds up the query      As discussed in Section 3, our LM implementa-
of “cat fell down”.                                           tions can answer queries about context-encoded n-
not in the language model, language models with               grams faster than explicitly encoded n-grams. With
back-off schemes must in general perform multiple             this in mind, we augment the values stored in our
                                                                                                             n−1
queries to fetch the necessary back-off information.          language model so that for a key (wn , c(w1 )),
                                                                                                      n
                                                              we store the offset of the suffix c(w2 ) as well as
Our cache retains the full result of these calculations
and thus saves additional computation.                        the normal counts/probabilities. Then, rather than
   Federico and Cettolo (2007) also employ a cache            represent the LM context in the decoder as an ex-
in their language model implementation, though                plicit list of words, we can simply store context off-
based on traditional hash table cache with linear             sets. When we query the language model, we get
probing. Unlike our cache, which is of fixed size,             back both a language model score and context offset
their cache must be cleared after decoding a sen-                ˆ n−1
                                                              c(w1 ), where w1    ˆ n−1 is the the longest suffix of
                                                                 n−1
tence. We would not expect a large performance in-            w1 contained in the language model. We can then
crease from such a cache for our faster models since          quickly form the context encoding of the next query
our H ASH implementation is already a hash table              by simply concatenating the new word with the off-
with linear probing. We found in our experiments                     ˆ n−1
                                                              set c(w1 ) returned from the previous query.
that a cache using linear probing provided marginal
                                                                 In addition to speeding up language model
performance increases of about 40%, largely be-
                                                              queries, this approach also automatically supports an
cause of cached back-off computation, while our
                                                              equivalence of LM states (Li and Khudanpur, 2008):
simpler cache increases performance by about 300%
                                                              in standard back-off schemes, whenever we compute
even over our H ASH LM implementation. More tim-
                                                              the probability for an n-gram (wn , c(wn−1 )) when
                                                                                                        1
ing results are presented in Section 5.
                                                              wn−1 is not in the language model, the result will be
                                                                1
                                                                                                          ˆ n−1
                                                              the same as the result of the query (wn , c(w1 ). It
4.2 Exploiting Scrolling Queries
                                                              is therefore only necessary to store as much of the
Decoders with integrated language models (Och and             context as the language model contains instead of
Ney, 2004; Chiang, 2005) score partial translation            all n − 1 words in the context. If a decoder main-
hypotheses in an incremental way. Each partial hy-            tains LM states using the context offsets returned
pothesis maintains a language model context con-              by our language model, then the decoder will au-
sisting of at most n − 1 target-side words. When              tomatically exploit this equivalence and the size of
we combine two language model contexts, we create             the search space will be reduced. This same effect is
several new n-grams of length of n, each of which             exploited explicitly by some decoders (Li and Khu-
generate a query to the language model. These new             danpur, 2008).
W MT 2010                                                    W EB 1T

    LM Type          bytes/   bytes/   bytes/   Total         LM Type            bytes/    bytes/   bytes/     Total
                       key    value    n-gram   Size                             key       value    n-gram     Size
    SRILM-H              –        –    42.2     26.6G         Gzip               –         –        7.0        24.7G
    SRILM-S              –        –    33.5     21.1G         T-MPHR†            –         –        3.0        10.5G
    H ASH              5.6      6.0    11.6      7.5G         C OMPRESSED        1.3       1.6      2.9        10.2G
    S ORTED            4.0      4.5     8.5      5.5G
    TPT                  –        –     7.5∗∗    4.7G∗∗     Table 3: Memory usages of several language model im-
    C OMPRESSED        2.1      3.8     5.9      3.7G       plementations on the W EB 1T. A † indicates lossy com-
                                                            pression.
Table 2: Memory usages of several language model im-
plementations on the W MT 2010 language model. A               We compare against three baselines. The first two,
∗∗
   indicates that the storage in bytes per n-gram is re-    SRILM-H and SRILM-S, refer to the hash table-
ported for a different language model of comparable size,   and sorted array-based trie implementations pro-
and the total size is thus a rough projection.              vided by SRILM. The third baseline is the Tightly-
                                                            Packed Trie (TPT) implementation of Germann et
5        Experiments                                        al. (2009). Because this implementation is not freely
                                                            available, we use their published memory usage in
5.1 Data                                                    bytes per n-gram on a language model of similar
To test our LM implementations, we performed                size and project total usage.
experiments with two different language models.                The memory usage of all of our models is con-
Our first language model, W MT 2010, was a 5-                siderably smaller than SRILM – our H ASH imple-
gram Kneser-Ney language model which stores                 mentation is about 25% the size of SRILM-H, and
probability/back-off pairs as values. We trained this       our S ORTED implementation is about 25% the size
language model on the English side of all French-           of SRILM-S. Our C OMPRESSED implementation
English corpora provided6 for use in the WMT 2010           is also smaller than the state-of-the-art compressed
workshop, about 2 billion tokens in total. This data        TPT implementation.
was tokenized using the tokenizer.perl script                  In Table 3, we show the results of our C OM -
provided with the data. We trained the language             PRESSED implementation on W EB 1T and against
model using SRILM. We also extracted a count-               two baselines. The first is compression of the ASCII
based language model, W EB 1T, from the Web1T               text count files using gzip, and the second is the
corpus (Brants and Franz, 2006). Since this data is         Tiered Minimal Perfect Hash (T-MPHR) of Guthrie
provided as a collection of 1- to 5-grams and asso-         and Hepple (2010). The latter is a lossy compres-
ciated counts, we used this data without further pre-       sion technique based on Bloomier filters (Chazelle
processing. The make up of these language models            et al., 2004) and additional variable-length encod-
is shown in Table 1.                                        ing that achieves the best published compression of
                                                            W EB 1T to date. Our C OMPRESSED implementa-
5.2 Compression Experiments                                 tion is even smaller than T-MPHR, despite using a
We tested our three implementations (H ASH,                 lossless compression technique. Note that since T-
S ORTED, and C OMPRESSED) on the W MT 2010                  MPHR uses a lossy encoding, it is possible to re-
language model. For this language model, there are          duce the storage requirements arbitrarily at the cost
about 80 million unique probability/back-off pairs,         of additional errors in the model. We quote here the
so v ≈ 36. Note that here v includes both the               storage required when keys7 are encoded using 12-
cost per key of storing the value rank as well as the       bit hash codes, which gives a false positive rate of
(amortized) cost of storing two 32 bit floating point        about 2−12 =0.02%.
numbers (probability and back-off) for each unique             7
value. The results are shown in Table 2.                         Guthrie and Hepple (2010) also report additional savings
                                                            by quantizing values, though we could perform the same quan-
     6
         www.statmt.org/wmt10/translation-task.html         tization in our storage scheme.
LM Type               No Cache       Cache         Size       nificantly less space. S ORTED is slower but of
  C OMPRESSED          9264±73ns      565±7ns       3.7G        course more memory efficient, and C OMPRESSED
  S ORTED              1405±50ns      243±4ns       5.5G        is the slowest but also the most compact repre-
  H ASH                 495±10ns      179±6ns       7.5G
  SRILM-H               428±5ns       159±4ns      26.6G
                                                                sentation. In H ASH +S CROLL, we issued queries
  H ASH +S CROLL        323±5ns       139±6ns      10.5G        to the language model using the context encoding,
                                                                which speeds up queries substantially. Finally, we
Table 4: Raw query speeds of various language model             note that our direct-mapped cache is very effective.
implementations. Times were averaged over 3 runs on             The query speed of all models is boosted substan-
the same machine. For H ASH +S CROLL, all queries were          tially. In particular, our C OMPRESSED implementa-
issued to the decoder in context-encoded form, which            tion with caching is nearly as fast as SRILM-H with-
speeds up queries that exhibit scrolling behaviour. Note        out caching, and even the already fast H ASH imple-
that memory usage is higher than for H ASH because we           mentation is 300% faster in raw query speed with
store suffix offsets along with the values for an n-gram.
                                                                caching enabled.
   LM Type            No Cache        Cache          Size          We also measured the effect of LM performance
   C OMPRESSED        9880±82s      1547±7s         3.7G        on overall decoder performance. We modified
   SRILM-H            1120±26s       938±11s       26.6G        Joshua to optionally use our LM implementations
   H ASH              1146±8s        943±16s        7.5G        during decoding, and measured the time required
                                                                to decode all 2051 sentences of the 2008 News
Table 5: Full decoding times for various language model         test set. The results are shown in Table 5. With-
implementations. Our H ASH LM is as fast as SRILM               out caching, SRILM-H and H ASH were comparable
while using 25% of the memory. Our caching also re-             in speed, while C OMPRESSED introduces a perfor-
duces total decoding time by about 20% for our fastest
                                                                mance penalty. With caching enabled, overall de-
models and speeds up C OMPRESSED by a factor of 6.
Times were averaged over 3 runs on the same machine.            coder speed is improved for both H ASH and SRILM-
                                                                H, while the C OMPRESSED implementation is only
5.3 Timing Experiments                                          about 50% slower that the others.
We first measured pure query speed by logging all
LM queries issued by a decoder and measuring
                                                                6 Conclusion
the time required to query those n-grams in isola-              We have presented several language model imple-
tion. We used the the Joshua decoder8 with the                  mentations which are state-of-the-art in both size
W MT 2010 model to generate queries for the first                and speed. Our experiments have demonstrated im-
100 sentences of the French 2008 News test set. This            provements in query speed over SRILM and com-
produced about 30 million queries. We measured the              pression rates against state-of-the-art lossy compres-
time9 required to perform each query in order with              sion. We have also described a simple caching tech-
and without our direct-mapped caching, not includ-              nique which leads to performance increases in over-
ing any time spent on file I/O.                                  all decoding time.
   The results are shown in Table 4. As expected,
H ASH is the fastest of our implementations, and                Acknowledgements
comparable10 in speed to SRILM-H, but using sig-
                                                                This work was supported by a Google Fellowship for the
   8
    We used a grammar trained on all French-English data        first author and by BBN under DARPA contract HR0011-
provided for WMT 2010 using the make scripts provided           06-C-0022. We would like to thank David Chiang, Zhifei
at http://sourceforge.net/projects/joshua/files                 Li, and the anonymous reviewers for their helpful com-
/joshua/1.3/wmt2010-experiment.tgz/download
   9                                                            ments.
      All experiments were performed on an Amazon EC2 High-
Memory Quadruple Extra Large instance, with an Intel Xeon       nately, it is not completely fair to compare our LMs against ei-
X5550 CPU running at 2.67GHz and 8 MB of cache.                 ther of these numbers: although the JNI overhead slows down
   10
      Because we implemented our LMs in Java, we issued         SRILM, implementing our LMs in Java instead of C++ slows
queries to SRILM via Java Native Interface (JNI) calls, which   down our LMs. In the tables, we quote times which include
introduces a performance overhead. When called natively, we     the JNI overhead, since this reflects the true cost to a decoder
found that SRILM was about 200 ns/query faster. Unfortu-        written in Java (e.g. Joshua).
References                                                    Zhifei Li and Sanjeev Khudanpur. 2008. A scalable
                                                                 decoder for parsing-based machine translation with
Paolo Boldi and Sebastiano Vigna. 2005. Codes for the            equivalent language model state maintenance. In Pro-
   world wide web. Internet Mathematics, 2.                      ceedings of the Second Workshop on Syntax and Struc-
Thorsten Brants and Alex Franz. 2006. Google web1t               ture in Statistical Translation.
   5-gram corpus, version 1. In Linguistic Data Consor-       Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Gan-
   tium, Philadelphia, Catalog Number LDC2006T13.                itkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och,          N. G. Thornton, Jonathan Weese, and Omar F. Zaidan.
   and Jeffrey Dean. 2007. Large language models in              2009. Joshua: an open source toolkit for parsing-
   machine translation. In Proceedings of the Conference         based machine translation. In Proceedings of the
   on Empirical Methods in Natural Language Process-             Fourth Workshop on Statistical Machine Translation.
   ing.                                                       Franz Josef Och and Hermann Ney. 2004. The align-
Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld, and              ment template approach to statistical machine transla-
   Ayellet Tal. 2004. The Bloomier filter: an efficient            tion. Computationl Linguistics, 30:417–449, Decem-
   data structure for static support lookup tables. In Pro-      ber.
   ceedings of the fifteenth annual ACM-SIAM sympo-            Andreas Stolcke. 2002. SRILM: An extensible language
   sium on Discrete algorithms.                                  modeling toolkit. In Proceedings of Interspeech.
David Chiang. 2005. A hierarchical phrase-based model         E. W. D. Whittaker and B. Raj. 2001. Quantization-
   for statistical machine translation. In The Annual Con-       based language model compression. In Proceedings
   ference of the Association for Computational Linguis-         of Eurospeech.
   tics.
Kenneth Church, Ted Hart, and Jianfeng Gao. 2007.
   Compressing trigram language models with golomb
   coding. In Proceedings of the Conference on Empiri-
   cal Methods in Natural Language Processing.
Marcello Federico and Mauro Cettolo. 2007. Efficient
   handling of n-gram language models for statistical ma-
   chine translation. In Proceedings of the Second Work-
   shop on Statistical Machine Translation.
Edward Fredkin. 1960. Trie memory. Communications
   of the ACM, 3:490–499, September.
Ulrich Germann, Eric Joanis, and Samuel Larkin. 2009.
   Tightly packed tries: how to fit large models into mem-
   ory, and make them load fast, too. In Proceedings of
   the Workshop on Software Engineering, Testing, and
   Quality Assurance for Natural Language Processing.
S. W. Golomb. 1966. Run-length encodings. IEEE
   Transactions on Information Theory, 12.
David Guthrie and Mark Hepple. 2010. Storing the web
   in memory: space efficient language models with con-
   stant time retrieval. In Proceedings of the Conference
   on Empirical Methods in Natural Language Process-
   ing.
Boulos Harb, Ciprian Chelba, Jeffrey Dean, and Sanjay
   Ghemawat. 2009. Back-off language model compres-
   sion. In Proceedings of Interspeech.
Bo-June Hsu and James Glass. 2008. Iterative language
   model estimation: Efficient data structure and algo-
   rithms. In Proceedings of Interspeech.
Abby Levenberg and Miles Osborne. 2009. Stream-
   based randomised language models for smt. In Pro-
   ceedings of the Conference on Empirical Methods in
   Natural Language Processing.

Mais conteúdo relacionado

Mais procurados

Learning R via Python…or the other way around
Learning R via Python…or the other way aroundLearning R via Python…or the other way around
Learning R via Python…or the other way around
Sid Xing
 
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...
Align, Disambiguate and Walk  : A Unified Approach forMeasuring Semantic Simil...Align, Disambiguate and Walk  : A Unified Approach forMeasuring Semantic Simil...
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...
Koji Matsuda
 
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
zaidinvisible
 
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
Tomoki Koriyama
 

Mais procurados (20)

Learning R via Python…or the other way around
Learning R via Python…or the other way aroundLearning R via Python…or the other way around
Learning R via Python…or the other way around
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
Efficiency lossless data techniques for arabic text compression
Efficiency lossless data techniques for arabic text compressionEfficiency lossless data techniques for arabic text compression
Efficiency lossless data techniques for arabic text compression
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
A Lossless FBAR Compressor
A Lossless FBAR CompressorA Lossless FBAR Compressor
A Lossless FBAR Compressor
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
Advances in Parallel Computing for Computational Mechanics. MPI for Python an...
Advances in Parallel Computing for Computational Mechanics. MPI for Python an...Advances in Parallel Computing for Computational Mechanics. MPI for Python an...
Advances in Parallel Computing for Computational Mechanics. MPI for Python an...
 
Performance analysis and implementation for nonbinary quasi cyclic ldpc decod...
Performance analysis and implementation for nonbinary quasi cyclic ldpc decod...Performance analysis and implementation for nonbinary quasi cyclic ldpc decod...
Performance analysis and implementation for nonbinary quasi cyclic ldpc decod...
 
IJETAE_1013_119
IJETAE_1013_119IJETAE_1013_119
IJETAE_1013_119
 
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...
Align, Disambiguate and Walk  : A Unified Approach forMeasuring Semantic Simil...Align, Disambiguate and Walk  : A Unified Approach forMeasuring Semantic Simil...
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...
 
10.1.1.70.8789
10.1.1.70.878910.1.1.70.8789
10.1.1.70.8789
 
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
 
Arabic Phoneme Recognition using Hierarchical Neural Fuzzy Petri Net and LPC ...
Arabic Phoneme Recognition using Hierarchical Neural Fuzzy Petri Net and LPC ...Arabic Phoneme Recognition using Hierarchical Neural Fuzzy Petri Net and LPC ...
Arabic Phoneme Recognition using Hierarchical Neural Fuzzy Petri Net and LPC ...
 
IJCTT-V4I9P137
IJCTT-V4I9P137IJCTT-V4I9P137
IJCTT-V4I9P137
 
Lossless LZW Data Compression Algorithm on CUDA
Lossless LZW Data Compression Algorithm on CUDALossless LZW Data Compression Algorithm on CUDA
Lossless LZW Data Compression Algorithm on CUDA
 
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED
 SPEECH S...
 
BERT
BERTBERT
BERT
 
Analysis of LDPC Codes under Wi-Max IEEE 802.16e
Analysis of LDPC Codes under Wi-Max IEEE 802.16eAnalysis of LDPC Codes under Wi-Max IEEE 802.16e
Analysis of LDPC Codes under Wi-Max IEEE 802.16e
 

Destaque (6)

Css trees
Css treesCss trees
Css trees
 
Cache conscious index mechanism for main-memory databases
Cache conscious index mechanism for main-memory databases Cache conscious index mechanism for main-memory databases
Cache conscious index mechanism for main-memory databases
 
Lsm trees
Lsm treesLsm trees
Lsm trees
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Semelhante a Pauls klein 2011-lm_paper(3)

FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORT
Dhrumil Shah
 
Scimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoScimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujo
Robson Araujo
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
IJNSA Journal
 
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
IJNSA Journal
 

Semelhante a Pauls klein 2011-lm_paper(3) (20)

Grammar Based Pre-Processing for PPM
Grammar Based Pre-Processing for PPM Grammar Based Pre-Processing for PPM
Grammar Based Pre-Processing for PPM
 
Grammar Based Pre-Processing for PPM
Grammar Based Pre-Processing for PPMGrammar Based Pre-Processing for PPM
Grammar Based Pre-Processing for PPM
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORT
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approach
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Scimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoScimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujo
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt
 
IEEE 2014 NS2 NETWORKING PROJECTS Fast regular expression matching using sma...
IEEE 2014 NS2 NETWORKING PROJECTS  Fast regular expression matching using sma...IEEE 2014 NS2 NETWORKING PROJECTS  Fast regular expression matching using sma...
IEEE 2014 NS2 NETWORKING PROJECTS Fast regular expression matching using sma...
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
Doppl Development Introduction
Doppl Development IntroductionDoppl Development Introduction
Doppl Development Introduction
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Local Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxLocal Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptx
 
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
 
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Pauls klein 2011-lm_paper(3)

  • 1. Faster and Smaller N -Gram Language Models Adam Pauls Dan Klein Computer Science Division University of California, Berkeley {adpauls,klein}@cs.berkeley.edu Abstract and Raj, 2001; Hsu and Glass, 2008), our meth- ods are conceptually based on tabular trie encodings N -gram language models are a major resource wherein each n-gram key is stored as the concatena- bottleneck in machine translation. In this pa- tion of one word (here, the last) and an offset encod- per, we present several language model imple- ing the remaining words (here, the context). After mentations that are both highly compact and presenting a bit-conscious basic system that typifies fast to query. Our fastest implementation is such approaches, we improve on it in several ways. as fast as the widely used SRILM while re- quiring only 25% of the storage. Our most First, we show how the last word of each entry can compact representation can store all 4 billion be implicitly encoded, almost entirely eliminating n-grams and associated counts for the Google its storage requirements. Second, we show that the n-gram corpus in 23 bits per n-gram, the most deltas between adjacent entries can be efficiently en- compact lossless representation to date, and coded with simple variable-length encodings. Third, even more compact than recent lossy compres- we investigate block-based schemes that minimize sion techniques. We also discuss techniques the amount of compressed-stream scanning during for improving query speed during decoding, including a simple but novel language model lookup. caching technique that improves the query To speed up our language models, we present two speed of our language models (and SRILM) approaches. The first is a front-end cache. Caching by up to 300%. itself is certainly not new to language modeling, but because well-tuned LMs are essentially lookup ta- bles to begin with, naive cache designs only speed 1 Introduction up slower systems. We present a direct-addressing For modern statistical machine translation systems, cache with a fast key identity check that speeds up language models must be both fast and compact. our systems (or existing fast systems like the widely- The largest language models (LMs) can contain as used, speed-focused SRILM) by up to 300%. many as several hundred billion n-grams (Brants Our second speed-up comes from a more funda- et al., 2007), so storage is a challenge. At the mental change to the language modeling interface. same time, decoding a single sentence can trig- Where classic LMs take word tuples and produce ger hundreds of thousands of queries to the lan- counts or probabilities, we propose an LM that takes guage model, so speed is also critical. As al- a word-and-context encoding (so the context need ways, trade-offs exist between time, space, and ac- not be re-looked up) and returns both the probabil- curacy, with many recent papers considering small- ity and also the context encoding for the suffix of the but-approximate noisy LMs (Chazelle et al., 2004; original query. This setup substantially accelerates Guthrie and Hepple, 2010) or small-but-slow com- the scrolling queries issued by decoders, and also pressed LMs (Germann et al., 2009). exploits language model state equivalence (Li and In this paper, we present several lossless meth- Khudanpur, 2008). ods for compactly but efficiently storing large LMs Overall, we are able to store the 4 billion n-grams in memory. As in much previous work (Whittaker of the Google Web1T (Brants and Franz, 2006) cor-
  • 2. pus, with associated counts, in 10 GB of memory, we encode counts, probabilities and/or back-off which is smaller than state-of-the-art lossy language weights in our model. In general, the number of bits model implementations (Guthrie and Hepple, 2010), per value required to encode all value ranks for a and significantly smaller than the best published given language model will vary – we will refer to lossless implementation (Germann et al., 2009). We this variable as v . are also able to simultaneously outperform SRILM in both total size and speed. Our LM toolkit, which 2.2 Trie-Based Language Models is implemented in Java and compatible with the stan- The data structure of choice for the majority of dard ARPA file formats, is available on the web.1 modern language model implementations is a trie (Fredkin, 1960). Tries or variants thereof are 2 Preliminaries implemented in many LM tool kits, including Our goal in this paper is to provide data structures SRILM (Stolcke, 2002), IRSTLM (Federico and that map n-gram keys to values, i.e. probabilities Cettolo, 2007), CMU SLM (Whittaker and Raj, or counts. Maps are fundamental data structures 2001), and MIT LM (Hsu and Glass, 2008). Tries and generic implementations of mapping data struc- represent collections of n-grams using a tree. Each tures are readily available. However, because of the node in the tree encodes a word, and paths in the sheer number of keys and values needed for n-gram tree correspond to n-grams in the collection. Tries language modeling, generic implementations do not ensure that each n-gram prefix is represented only work efficiently “out of the box.” In this section, once, and are very efficient when n-grams share we will review existing techniques for encoding the common prefixes. Values can also be stored in a trie keys and values of an n-gram language model, tak- by placing them in the appropriate nodes. ing care to account for every bit of memory required Conceptually, trie nodes can be implemented as by each implementation. records that contain two entries: one for the word To provide absolute numbers for the storage re- in the node, and one for either a pointer to the par- quirements of different implementations, we will ent of the node or a list of pointers to children. At use the Google Web1T corpus as a benchmark. This a low level, however, naive implementations of tries corpus, which is on the large end of corpora typically can waste significant amounts of space. For exam- employed in language modeling, is a collection of ple, the implementation used in SRILM represents a nearly 4 billion n-grams extracted from over a tril- trie node as a C struct containing a 32-bit integer lion tokens of English text, and has a vocabulary of representing the word, a 64-bit memory2 pointer to about 13.5 million words. the list of children, and a 32-bit floating point num- ber representing the value stored at a node. The total 2.1 Encoding Values storage for a node alone is 16 bytes, with additional overhead required to store the list of children. In In the Web1T corpus, the most frequent n-gram total, the most compact implementation in SRILM occurs about 95 billion times. Storing this count uses 33 bytes per n-gram of storage, which would explicitly would require 37 bits, but, as noted by require around 116 GB of memory to store Web1T. Guthrie and Hepple (2010), the corpus contains only While it is simple to implement a trie node in this about 770 000 unique counts, so we can enumerate (already wasteful) way in programming languages all counts using only 20 bits, and separately store an that offer low-level access to memory allocation like array called the value rank array which converts the C/C++, the situation is even worse in higher level rank encoding of a count back to its raw count. The programming languages. In Java, for example, C- additional array is small, requiring only about 3MB, style structs are not available, and records are but we save 17 bits per n-gram, reducing value stor- most naturally implemented as objects that carry an age from around 16GB to about 9GB for Web1T. additional 64 bits of overhead. We can rank encode probabilities and back-offs in 2 the same way, allowing us to be agnostic to whether While 32-bit architectures are still in use today, their lim- ited address space is insufficient for modern language models 1 http://code.google.com/p/berkeleylm/ and we will assume all machines use a 64-bit architecture.
  • 3. n−1 Despite its relatively large storage requirements, encode c(w1 ). We will refer to this encoding as a the implementation employed by SRILM is still context encoding. widely in use today, largely because of its speed – to Note that typically, n-grams are encoded in tries our knowledge, SRILM is the fastest freely available in the reverse direction (first-rest instead of last- language model implementation. We will show that rest), which enables a more efficient computation of we can achieve access speeds comparable to SRILM back-offs. In our implementations, we found that the but using only 25% of the storage. speed improvement from switching to a first-rest en- coding and implementing more efficient queries was 2.3 Implicit Tries modest. However, as we will see in Section 4.2, the A more compact implementation of a trie is de- last-rest encoding allows us to exploit the scrolling scribed in Whittaker and Raj (2001). In their imple- nature of queries issued by decoders, which results mentation, nodes in a trie are represented implicitly in speedups that far outweigh those achieved by re- as entries in an array. Each entry encodes a word versing the trie. with enough bits to index all words in the language model (24 bits for Web1T), a quantized value, and 3 Language Model Implementations a 32-bit3 offset that encodes the contiguous block In the previous section, we reviewed well-known of the array containing the children of the node. techniques in language model implementation. In Note that 32 bits is sufficient to index all n-grams in this section, we combine these techniques to build Web1T; for larger corpora, we can always increase simple data structures in ways that are to our knowl- the size of the offset. edge novel, producing language models with state- Effectively, this representation replaces system- of-the-art memory requirements and speed. We will level memory pointers with offsets that act as logical also show that our data structures can be very effec- pointers that can reference other entries in the array, tively compressed by implicitly encoding the word rather than arbitrary bytes in RAM. This represen- wn , and further compressed by applying a variable- tation saves space because offsets require fewer bits length encoding on context deltas. than memory pointers, but more importantly, it per- mits straightforward implementation in any higher- 3.1 Sorted Array level language that provides access to arrays of inte- A standard way to implement a map is to store an gers.4 array of key/value pairs, sorted according to the key. 2.4 Encoding n-grams Lookup is carried out by performing binary search on a key. For an n-gram language model, we can ap- Hsu and Glass (2008) describe a variant of the im- ply this implementation with a slight modification: plicit tries of Whittaker and Raj (2001) in which we need n sorted arrays, one for each n-gram order. each node in the trie stores the prefix (i.e. parent). n−1 We construct keys (wn , c(w1 )) using the context This representation has the property that we can re- encoding described in the previous section, where n fer to each n-gram w1 by its last word wn and the the context offsets c refer to entries in the sorted ar- n−1 n−1 offset c(w1 ) of its prefix w1 , often called the ray of (n − 1)-grams. This data structure is shown context. At a low-level, we can efficiently encode graphically in Figure 1. n−1 this pair (wn , c(w1 )) as a single 64-bit integer, Because our keys are sorted according to their where the first 24 bits refer to wn and the last 40 bits context-encoded representation, we cannot straight- 3 The implementation described in the paper represents each forwardly answer queries about an n-gram w with- 32-bit integer compactly using only 16 bits, but this represen- out first determining its context encoding. We can tation is quite inefficient, because determining the full 32-bit do this efficiently by building up the encoding in- offset requires a binary search in a look up table. crementally: we start with the context offset of the 4 Typically, programming languages only provide support for arrays of bytes, not bits, but it is of course possible to simu- unigram w1 , which is simply its integer representa- late arrays with arbitrary numbers of bits using byte arrays and tion, and use that to form the context encoding of the bit manipulation. 2 bigram w1 = (w2 , c(w1 )). We can find the offset of
  • 4. 3-grams 2-grams 1-grams . . . . w c val w c val w val 1933 15176583 6879 00004498 . 1933 15176585 6879 00004502 . 1933 15176593 6879 00004530 “slept” . “slept” 1933 15176613 “cat” 6879 00004568 . 1933 15179801 6879 00004588 “left” 1933 15180051 6879 00004598 “the” 1933 15180053 6879 00004668 “had” 1935 15176585 6880 00004669 . . “ran” 1935 15176589 “dog” 6880 00004568 “dog” 1935 15176591 6880 00004577 . . . . . . 24 40 v bits bits bits 64 bits Figure 1: Our S ORTED implementation of a trie. The dotted paths correspond to “the cat slept”, “the cat ran”, and “the dog ran”. Each node in the trie is an entry in an array with 3 parts: w represents the word at the node; val represents the (rank encoded) value; and c is an offset in the array of n − 1 grams that represents the parent (prefix) of a node. Words are represented as offsets in the unigram array. the bigram using binary search, and form the context n insert an n-gram w1 into the hash table, we must encoding of the trigram, and so on. Note, however, form its context encoding incrementally from the that if our queries arrive in context-encoded form, offset of w1 . However, unlike the sorted array im- queries are faster since they involve only one binary plementation, at query time, we only need to be search in the appropriate array. We will return to this able to check equality between the query key wn = 1 later in Section 4.2 (wn , c(w1 )) and a key w1n = (wn , c(w1n−1 )) in n−1 This implementation, S ORTED, uses 64 bits for the table. Equality can easily be checked by first the integer-encoded keys and v bits for the values. checking if wn = wn , then recursively checking Lookup is linear in the length of the key and log- equality between wn−1 and w1n−1 , though again, 1 arithmic in the number of n-grams. For Web1T equality is even faster if the query is already context- (v = 20), the total storage is 10.5 bytes/n-gram or encoded. about 37GB. This H ASH data structure also uses 64 bits for integer-encoded keys and v bits for values. How- 3.2 Hash Table ever, to avoid excessive hash collisions, we also al- Hash tables are another standard way to implement locate additional empty space according to a user- associative arrays. To enable the use of our context defined parameter that trades off speed and time – encoding, we require an implementation in which we used about 40% extra space in our experiments. we can refer to entries in the hash table via array For Web1T, the total storage for this implementation offsets. For this reason, we use an open address hash is 15 bytes/n-gram or about 53 GB total. map that uses linear probing for collision resolution. Look up in a hash map is linear in the length of As in the sorted array implementation, in order to an n-gram and constant with respect to the number
  • 5. of n-grams. Unlike the sorted array implementa- (a) Context-Encoding (b) Context Deltas (c) Bits Required w c val !w !c val |!w| |!c| |val| tion, the hash table implementation also permits ef- 1933 15176585 3 1933 15176585 3 24 40 3 ficient insertion and deletion, making it suitable for 1933 15176587 2 +0 +2 1 2 3 3 stream-based language models (Levenberg and Os- 1933 15176593 1 +0 +5 1 2 3 3 1933 15176613 8 +0 +40 8 2 9 6 borne, 2009). 1933 15179801 1 +0 +188 1 2 12 3 1935 15176585 298 +2 15176585 298 4 36 15 3.3 Implicitly Encoding wn 1935 15176589 1 +0 +4 1 2 6 3 Value rank The context encoding we have used thus far still for header (d) Compressed Array key wastes space. This is perhaps most evident in the 1933 15176585 563097887 956 3 0 +0 +2 2 +0 +5 1 +0 +40 8 . . . sorted array representation (see Figure 1): all n- Logical Number True if grams ending with a particular word wi are stored Header key offset of of bits all !w in this block in this block are contiguously. We can exploit this redundancy by block 0 storing only the context offsets in the main array, using as many bits as needed to encode all context Figure 2: Compression using variable-length encoding. offsets (32 bits for Web1T). In auxiliary arrays, one (a) A snippet of an (uncompressed) context-encoded ar- for each n-gram order, we store the beginning and ray. (b) The context and word deltas. (c) The number end of the range of the trie array in which all (wi , c) of bits required to encode the context and word deltas as well as the value ranks. Word deltas use variable-length keys are stored for each wi . These auxiliary arrays block coding with k = 1, while context deltas and value are negligibly small – we only need to store 2n off- ranks use k = 2. (d) A snippet of the compressed encod- sets for each word. ing array. The header is outlined in bold. The same trick can be applied in the hash table implementation. We allocate contiguous blocks of of the key array used in our sorted array implemen- the main array for n-grams which all share the same tation. While we have already exploited the fact that last word wi , and distribute keys within those ranges the 24 word bits repeat in the previous section, we using the hashing function. note here that consecutive context offsets tend to be This representation reduces memory usage for quite close together. We found that for 5-grams, the keys from 64 bits to 32 bits, reducing overall storage median difference between consecutive offsets was for Web1T to 6.5 bytes/n-gram for the sorted imple- about 50, and 90% of offset deltas were smaller than mentation and 9.1 bytes for the hashed implementa- 10000. By using a variable-length encoding to rep- tion, or about 23GB and 32GB in total. It also in- resent these deltas, we should require far fewer than creases query speed in the sorted array case, since to 32 bits to encode context offsets. find (wi , c), we only need to search the range of the We used a very simple variable-length coding to array over which wi applies. Because this implicit encode offset deltas, word deltas, and value ranks. encoding reduces memory usage without a perfor- Our encoding, which is referred to as “variable- mance cost, we will assume its use for the rest of length block coding” in Boldi and Vigna (2005), this paper. works as follows: we pick a (configurable) radix r = 2k . To encode a number m, we determine the 3.4 A Compressed Implementation number of digits d required to express m in base r. 3.4.1 Variable-Length Coding We write d in unary, i.e. d − 1 zeroes followed by The distribution of value ranks in language mod- a one. We then write the d digits of m in base r, eling is Zipfian, with far more n-grams having low each of which requires k bits. For example, using counts than high counts. If we ensure that the value k = 2, we would encode the decimal number 7 as rank array sorts raw values by descending order of 010111. We can choose k separately for deltas and frequency, then we expect that small ranks will oc- value indices, and also tune these parameters to a cur much more frequently than large ones, which we given language model. can exploit with a variable-length encoding. We found this encoding outperformed other To compress n-grams, we can exploit the context standard prefix codes, including Golomb encoding of our keys. In Figure 2, we show a portion codes (Golomb, 1966; Church et al., 2007)
  • 6. and Elias γ and δ codes. We also experimented 6GB less than the storage required by Germann et with the ζ codes of Boldi and Vigna (2005), which al. (2009), which is the best published lossless com- modify variable-length block codes so that they pression to date. are optimal for certain power law distributions. We found that ζ codes performed no better than 4 Speeding up Decoding variable-length block codes and were slightly more complex. Finally, we found that Huffman codes In the previous section, we provided compact and outperformed our encoding slightly, but came at a efficient implementations of associative arrays that much higher computational cost. allow us to query a value for an arbitrary n-gram. However, decoders do not issue language model re- 3.4.2 Block Compression quests at random. In this section, we show that lan- We could in principle compress the entire array of guage model requests issued by a standard decoder key/value pairs with the encoding described above, exhibit two patterns we can exploit: they are highly but this would render binary search in the array im- repetitive, and also exhibit a scrolling effect. possible: we cannot jump to the mid-point of the ar- ray since in order to determine what key lies at a par- 4.1 Exploiting Repetitive Queries ticular point in the compressed bit stream, we would In a simple experiment, we recorded all of the need to know the entire history of offset deltas. language model queries issued by the Joshua de- Instead, we employ block compression, a tech- coder (Li et al., 2009) on a 100 sentence test set. nique also used by Harb et al. (2009) for smaller Of the 31 million queries, only about 1 million were language models. In particular, we compress the unique. Therefore, we expect that keeping the re- key/value array in blocks of 128 bytes. At the be- sults of language model queries in a cache should be ginning of the block, we write out a header consist- effective at reducing overall language model latency. ing of: an explicit 64-bit key that begins the block; To this end, we added a very simple cache to a 32-bit integer representing the offset of the header our language model. Our cache uses an array of key in the uncompressed array;5 the number of bits key/value pairs with size fixed to 2b − 1 for some of compressed data in the block; and the variable- integer b (we used 24). We use a b-bit hash func- length encoding of the value rank of the header key. tion to compute the address in an array where we The remainder of the block is filled with as many will always place a given n-gram and its fully com- compressed key/value pairs as possible. Once the puted language model score. Querying the cache is block is full, we start a new block. See Figure 2 for straightforward: we check the address of a key given a depiction. by its b-bit hash. If the key located in the cache ar- When we encode an offset delta, we store the delta ray matches the query key, then we return the value of the word portion of the key separately from the stored in the cache. Otherwise, we fetch the lan- delta of the context offset. When an entire block guage model probability from the language model shares the same word portion of the key, we set a and place the new key and value in the cache, evict- single bit in the header that indicates that we do not ing the old key in the process. This scheme is often encode any word deltas. called a direct-mapped cache because each key has To find a key in this compressed array, we first exactly one possible address. perform binary search over the header blocks (which Caching n-grams in this way reduces overall la- are predictably located every 128 bytes), followed tency for two reasons: first, lookup in the cache is by a linear search within a compressed block. extremely fast, requiring only a single evaluation of Using k = 6 for encoding offset deltas and k = 5 the hash function, one memory lookup to find the for encoding value ranks, this C OMPRESSED im- cache key, and one equality check on the key. In plementation stores Web1T in less than 3 bytes per contrast, even our fastest (H ASH) implementation n-gram, or about 10.2GB in total. This is about may have to perform multiple memory lookups and 5 We need this because n-grams refer to their contexts using equality checks in order to resolve collisions. Sec- array offsets. ond, when calculating the probability for an n-gram
  • 7. the cat + fell down W MT 2010 W EB 1T Representation the cat fell LM 0.76 Order #n-grams Order #n-grams Explicit 1gm 4,366,395 1gm 13,588,391 cat fell down LM 0.12 2gm 61,865,588 2gm 314,843,401 “the cat” 3gm 123,158,761 3gm 977,069,902 18569876 fell LM 0.76 4gm 217,869,981 4gm 1,313,818,354 Encoding 5gm 269,614,330 5gm 1,176,470,663 Context 3576410 “cat fell” Total 676,875,055 Total 3,795,790,711 35764106 down LM 0.12 Table 1: Sizes of the two language models used in our experiments. Figure 3: Queries issued when scoring trigrams that are created when a state with LM context “the cat” combines with “fell down”. In the standard explicit representation n-grams exhibit a scrolling effect, shown in Fig- of an n-gram as list of words, queries are issued atom- ure 3: the n − 1 suffix words of one n-gram form ically to the language model. When using a context- the n − 1 prefix words of the next. encoding, a query from the n-gram “the cat fell” returns the context offset of “cat fell”, which speeds up the query As discussed in Section 3, our LM implementa- of “cat fell down”. tions can answer queries about context-encoded n- not in the language model, language models with grams faster than explicitly encoded n-grams. With back-off schemes must in general perform multiple this in mind, we augment the values stored in our n−1 queries to fetch the necessary back-off information. language model so that for a key (wn , c(w1 )), n we store the offset of the suffix c(w2 ) as well as Our cache retains the full result of these calculations and thus saves additional computation. the normal counts/probabilities. Then, rather than Federico and Cettolo (2007) also employ a cache represent the LM context in the decoder as an ex- in their language model implementation, though plicit list of words, we can simply store context off- based on traditional hash table cache with linear sets. When we query the language model, we get probing. Unlike our cache, which is of fixed size, back both a language model score and context offset their cache must be cleared after decoding a sen- ˆ n−1 c(w1 ), where w1 ˆ n−1 is the the longest suffix of n−1 tence. We would not expect a large performance in- w1 contained in the language model. We can then crease from such a cache for our faster models since quickly form the context encoding of the next query our H ASH implementation is already a hash table by simply concatenating the new word with the off- with linear probing. We found in our experiments ˆ n−1 set c(w1 ) returned from the previous query. that a cache using linear probing provided marginal In addition to speeding up language model performance increases of about 40%, largely be- queries, this approach also automatically supports an cause of cached back-off computation, while our equivalence of LM states (Li and Khudanpur, 2008): simpler cache increases performance by about 300% in standard back-off schemes, whenever we compute even over our H ASH LM implementation. More tim- the probability for an n-gram (wn , c(wn−1 )) when 1 ing results are presented in Section 5. wn−1 is not in the language model, the result will be 1 ˆ n−1 the same as the result of the query (wn , c(w1 ). It 4.2 Exploiting Scrolling Queries is therefore only necessary to store as much of the Decoders with integrated language models (Och and context as the language model contains instead of Ney, 2004; Chiang, 2005) score partial translation all n − 1 words in the context. If a decoder main- hypotheses in an incremental way. Each partial hy- tains LM states using the context offsets returned pothesis maintains a language model context con- by our language model, then the decoder will au- sisting of at most n − 1 target-side words. When tomatically exploit this equivalence and the size of we combine two language model contexts, we create the search space will be reduced. This same effect is several new n-grams of length of n, each of which exploited explicitly by some decoders (Li and Khu- generate a query to the language model. These new danpur, 2008).
  • 8. W MT 2010 W EB 1T LM Type bytes/ bytes/ bytes/ Total LM Type bytes/ bytes/ bytes/ Total key value n-gram Size key value n-gram Size SRILM-H – – 42.2 26.6G Gzip – – 7.0 24.7G SRILM-S – – 33.5 21.1G T-MPHR† – – 3.0 10.5G H ASH 5.6 6.0 11.6 7.5G C OMPRESSED 1.3 1.6 2.9 10.2G S ORTED 4.0 4.5 8.5 5.5G TPT – – 7.5∗∗ 4.7G∗∗ Table 3: Memory usages of several language model im- C OMPRESSED 2.1 3.8 5.9 3.7G plementations on the W EB 1T. A † indicates lossy com- pression. Table 2: Memory usages of several language model im- plementations on the W MT 2010 language model. A We compare against three baselines. The first two, ∗∗ indicates that the storage in bytes per n-gram is re- SRILM-H and SRILM-S, refer to the hash table- ported for a different language model of comparable size, and sorted array-based trie implementations pro- and the total size is thus a rough projection. vided by SRILM. The third baseline is the Tightly- Packed Trie (TPT) implementation of Germann et 5 Experiments al. (2009). Because this implementation is not freely available, we use their published memory usage in 5.1 Data bytes per n-gram on a language model of similar To test our LM implementations, we performed size and project total usage. experiments with two different language models. The memory usage of all of our models is con- Our first language model, W MT 2010, was a 5- siderably smaller than SRILM – our H ASH imple- gram Kneser-Ney language model which stores mentation is about 25% the size of SRILM-H, and probability/back-off pairs as values. We trained this our S ORTED implementation is about 25% the size language model on the English side of all French- of SRILM-S. Our C OMPRESSED implementation English corpora provided6 for use in the WMT 2010 is also smaller than the state-of-the-art compressed workshop, about 2 billion tokens in total. This data TPT implementation. was tokenized using the tokenizer.perl script In Table 3, we show the results of our C OM - provided with the data. We trained the language PRESSED implementation on W EB 1T and against model using SRILM. We also extracted a count- two baselines. The first is compression of the ASCII based language model, W EB 1T, from the Web1T text count files using gzip, and the second is the corpus (Brants and Franz, 2006). Since this data is Tiered Minimal Perfect Hash (T-MPHR) of Guthrie provided as a collection of 1- to 5-grams and asso- and Hepple (2010). The latter is a lossy compres- ciated counts, we used this data without further pre- sion technique based on Bloomier filters (Chazelle processing. The make up of these language models et al., 2004) and additional variable-length encod- is shown in Table 1. ing that achieves the best published compression of W EB 1T to date. Our C OMPRESSED implementa- 5.2 Compression Experiments tion is even smaller than T-MPHR, despite using a We tested our three implementations (H ASH, lossless compression technique. Note that since T- S ORTED, and C OMPRESSED) on the W MT 2010 MPHR uses a lossy encoding, it is possible to re- language model. For this language model, there are duce the storage requirements arbitrarily at the cost about 80 million unique probability/back-off pairs, of additional errors in the model. We quote here the so v ≈ 36. Note that here v includes both the storage required when keys7 are encoded using 12- cost per key of storing the value rank as well as the bit hash codes, which gives a false positive rate of (amortized) cost of storing two 32 bit floating point about 2−12 =0.02%. numbers (probability and back-off) for each unique 7 value. The results are shown in Table 2. Guthrie and Hepple (2010) also report additional savings by quantizing values, though we could perform the same quan- 6 www.statmt.org/wmt10/translation-task.html tization in our storage scheme.
  • 9. LM Type No Cache Cache Size nificantly less space. S ORTED is slower but of C OMPRESSED 9264±73ns 565±7ns 3.7G course more memory efficient, and C OMPRESSED S ORTED 1405±50ns 243±4ns 5.5G is the slowest but also the most compact repre- H ASH 495±10ns 179±6ns 7.5G SRILM-H 428±5ns 159±4ns 26.6G sentation. In H ASH +S CROLL, we issued queries H ASH +S CROLL 323±5ns 139±6ns 10.5G to the language model using the context encoding, which speeds up queries substantially. Finally, we Table 4: Raw query speeds of various language model note that our direct-mapped cache is very effective. implementations. Times were averaged over 3 runs on The query speed of all models is boosted substan- the same machine. For H ASH +S CROLL, all queries were tially. In particular, our C OMPRESSED implementa- issued to the decoder in context-encoded form, which tion with caching is nearly as fast as SRILM-H with- speeds up queries that exhibit scrolling behaviour. Note out caching, and even the already fast H ASH imple- that memory usage is higher than for H ASH because we mentation is 300% faster in raw query speed with store suffix offsets along with the values for an n-gram. caching enabled. LM Type No Cache Cache Size We also measured the effect of LM performance C OMPRESSED 9880±82s 1547±7s 3.7G on overall decoder performance. We modified SRILM-H 1120±26s 938±11s 26.6G Joshua to optionally use our LM implementations H ASH 1146±8s 943±16s 7.5G during decoding, and measured the time required to decode all 2051 sentences of the 2008 News Table 5: Full decoding times for various language model test set. The results are shown in Table 5. With- implementations. Our H ASH LM is as fast as SRILM out caching, SRILM-H and H ASH were comparable while using 25% of the memory. Our caching also re- in speed, while C OMPRESSED introduces a perfor- duces total decoding time by about 20% for our fastest mance penalty. With caching enabled, overall de- models and speeds up C OMPRESSED by a factor of 6. Times were averaged over 3 runs on the same machine. coder speed is improved for both H ASH and SRILM- H, while the C OMPRESSED implementation is only 5.3 Timing Experiments about 50% slower that the others. We first measured pure query speed by logging all LM queries issued by a decoder and measuring 6 Conclusion the time required to query those n-grams in isola- We have presented several language model imple- tion. We used the the Joshua decoder8 with the mentations which are state-of-the-art in both size W MT 2010 model to generate queries for the first and speed. Our experiments have demonstrated im- 100 sentences of the French 2008 News test set. This provements in query speed over SRILM and com- produced about 30 million queries. We measured the pression rates against state-of-the-art lossy compres- time9 required to perform each query in order with sion. We have also described a simple caching tech- and without our direct-mapped caching, not includ- nique which leads to performance increases in over- ing any time spent on file I/O. all decoding time. The results are shown in Table 4. As expected, H ASH is the fastest of our implementations, and Acknowledgements comparable10 in speed to SRILM-H, but using sig- This work was supported by a Google Fellowship for the 8 We used a grammar trained on all French-English data first author and by BBN under DARPA contract HR0011- provided for WMT 2010 using the make scripts provided 06-C-0022. We would like to thank David Chiang, Zhifei at http://sourceforge.net/projects/joshua/files Li, and the anonymous reviewers for their helpful com- /joshua/1.3/wmt2010-experiment.tgz/download 9 ments. All experiments were performed on an Amazon EC2 High- Memory Quadruple Extra Large instance, with an Intel Xeon nately, it is not completely fair to compare our LMs against ei- X5550 CPU running at 2.67GHz and 8 MB of cache. ther of these numbers: although the JNI overhead slows down 10 Because we implemented our LMs in Java, we issued SRILM, implementing our LMs in Java instead of C++ slows queries to SRILM via Java Native Interface (JNI) calls, which down our LMs. In the tables, we quote times which include introduces a performance overhead. When called natively, we the JNI overhead, since this reflects the true cost to a decoder found that SRILM was about 200 ns/query faster. Unfortu- written in Java (e.g. Joshua).
  • 10. References Zhifei Li and Sanjeev Khudanpur. 2008. A scalable decoder for parsing-based machine translation with Paolo Boldi and Sebastiano Vigna. 2005. Codes for the equivalent language model state maintenance. In Pro- world wide web. Internet Mathematics, 2. ceedings of the Second Workshop on Syntax and Struc- Thorsten Brants and Alex Franz. 2006. Google web1t ture in Statistical Translation. 5-gram corpus, version 1. In Linguistic Data Consor- Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Gan- tium, Philadelphia, Catalog Number LDC2006T13. itkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, N. G. Thornton, Jonathan Weese, and Omar F. Zaidan. and Jeffrey Dean. 2007. Large language models in 2009. Joshua: an open source toolkit for parsing- machine translation. In Proceedings of the Conference based machine translation. In Proceedings of the on Empirical Methods in Natural Language Process- Fourth Workshop on Statistical Machine Translation. ing. Franz Josef Och and Hermann Ney. 2004. The align- Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld, and ment template approach to statistical machine transla- Ayellet Tal. 2004. The Bloomier filter: an efficient tion. Computationl Linguistics, 30:417–449, Decem- data structure for static support lookup tables. In Pro- ber. ceedings of the fifteenth annual ACM-SIAM sympo- Andreas Stolcke. 2002. SRILM: An extensible language sium on Discrete algorithms. modeling toolkit. In Proceedings of Interspeech. David Chiang. 2005. A hierarchical phrase-based model E. W. D. Whittaker and B. Raj. 2001. Quantization- for statistical machine translation. In The Annual Con- based language model compression. In Proceedings ference of the Association for Computational Linguis- of Eurospeech. tics. Kenneth Church, Ted Hart, and Jianfeng Gao. 2007. Compressing trigram language models with golomb coding. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing. Marcello Federico and Mauro Cettolo. 2007. Efficient handling of n-gram language models for statistical ma- chine translation. In Proceedings of the Second Work- shop on Statistical Machine Translation. Edward Fredkin. 1960. Trie memory. Communications of the ACM, 3:490–499, September. Ulrich Germann, Eric Joanis, and Samuel Larkin. 2009. Tightly packed tries: how to fit large models into mem- ory, and make them load fast, too. In Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing. S. W. Golomb. 1966. Run-length encodings. IEEE Transactions on Information Theory, 12. David Guthrie and Mark Hepple. 2010. Storing the web in memory: space efficient language models with con- stant time retrieval. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing. Boulos Harb, Ciprian Chelba, Jeffrey Dean, and Sanjay Ghemawat. 2009. Back-off language model compres- sion. In Proceedings of Interspeech. Bo-June Hsu and James Glass. 2008. Iterative language model estimation: Efficient data structure and algo- rithms. In Proceedings of Interspeech. Abby Levenberg and Miles Osborne. 2009. Stream- based randomised language models for smt. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing.