An experimental study in using natural admixture as an alternative for chemic...
Faster and smaller inverted indices with Treaps Research Paper
1. Faster and Smaller
Inverted Indices with
Treaps
SD Nelson 148232M
DMI De Silva 148207R
2. Outline
• Introduction
• Basic Concepts
• Related Work
• Treap Usage
• Experiments & Results
• Conclusions
2
3. Introduction
• New Representation of inverted index, based on
Treap data structure
• Two main challenges in Modern Information retrieval
systems
• Manage huge amounts of data
• Return very precise results in response to user queries
• Two-stage ranking process
• fast and simple extract with hundreds/thousands from billions of
documents
• complex learned ranking to reduce candidate set
• Focus on improving the efficiency of the first stage
3
4. Introduction
• Two approaches for first stage
• Ranked intersection
Boolean intersection & computation of scores for documents
• Ranked union
Approximate form, avoiding a costly Boolean union
• New compressed representation for posting lists
• Performs ranked intersections & (exact) unions directly
• Based on the Treap data structure
• Allows to differentially encode both document identifiers and
weights
4
5. Basic Concepts
• Inverted index for efficient processing of ranked and
Boolean queries
• Index Store vocabulary of the collection
• Document identifier (docid)
• Weight of the term
• Idea of achieve compression to differentially encode
either the document identifiers/ weights
• New in-memory posting list implementation instead of
traditional disk storing.
5
6. Related Work
• Two query processing strategies
• Term-at-a-time (TAAT) - one posting list after the other, shortest
to longest
• Document-at-a-time (DAAT) - lists are processed in parallel
looking for the same document in all.
• Ranked intersection strategies employ full Boolean
intersection
• followed by a post processing step for ranking
• Strategies used for ranked union and intersection queries
in the paper can be classified as DAAT
6
7. Related Work
• Two approaches : Block-Max
• Special-purpose structure for ranked intersections and unions
• Sorts the list by Increasing docid, cuts lists into blocks, and
stores the maximumweight for each block
• Enables to skip whole blocks whose maximum possible
contribution is very low, by comparing its maximum weight with
a threshold
• Obtains considerable performance gains over the previous
techniques for exact ranked unions/ ranked intersections
• New technique can be seen as a generalization of the
block max concept
7
8. Related Work
• Two approaches : Dual-sorted inverted lists
• Sorted by decreasing frequency, using a wavelet tree data
structure
• TAAT processing for approximate ranked unions, DAAT-like
processing for (exact) ranked intersections.
• Ability sort by both docids and weights simultaneously
• Not aware the frequencies until reaching the individual
documents
• Treaps give an upper bound to the frequencies in the
current interval
• Treap uses less space - Dual-Sorted can’t use differential
encoding on docids.
8
9. TREAPS - Basic Usage
• Treap representation of a posting list.
• Search key – document id
• Max heap property – term frequency (weight)
9
10. TREAPS - Compacted Tree
• More compact tree topology
representation via a general tree
• Introduce fake root node to
general tree
• Treap root is the first child of fake
root node
• Left child of a Treap node first
child in general tree
• Right child of a Treap node next
sibling
• Dashed lines shows original tree
• Represent topology using
balanced parenthesis
representation.
10
11. TREAPS - Differential Encoding
• Calculate docid, frequency differences for each node
• For VL ,
• docid -> id(U) – id(VL)
• freq -> f(U) – f(VL)
• For VR,
• docid -> id(VR) – id(U)
• freq -> f(U) – f(VR)
U
VL VR
• Store the differences instead of the actual values using
DAC (Direct Addressable Codes)
11
12. TREAPS - Improvements
• Use of a single DAC for both docids, frequencies
• Making the tree of balanced by choosing the maximum
frequency closest to the center of the interval
• Omit all nodes having frequency below some threshold
12
13. TREAPS – Query Processing
• Given query ‘Q’ composed of ‘q’ no of terms ‘t’ (t є Q)
• Traverse ‘q’ treaps accumulating weights for each term ‘t’ for
each document
• Insert each document into a priority queue of size ‘k’
• If queue size ‘k+1’ remove the minimum
• Queue size ‘k’ - use minimum score as a lower bound,
discard documents to be checked during ‘intersection’.
• Since treaps maintain max frequency can discard all
nodes below a particular node.
13
14. Experiments & Results
• Experimental setup
• TREC GOV2 collection – 25.2 million documents, 32.8 million
terms, 4.9 billion postings
• Intel Xeon 2.4GHz / 96GB RAM / 12MB cache
• Compared against other implementations
• Block-Max
• Dual-Sorted
• Traditional docid-sorted inverted index
• Traditional frequency-sorted inverted index
14
15. Experiments & Results
• Using differential encoding alone
is not sufficient – ‘Treap w/o f0’
still has high space usage
• Omitting low frequency items
from treaps offers lowest space
usage (Treap)
• 22% than Block-Max
• 18% then Dual-Sorted
15
16. Experiments & Results
• Treaps effective for small ‘k’ (k < 30),
3x faster for ranked intersection.
• Treaps affected by ‘k’ unlike Block-
Max, Dual-Sorted.
• Explained by no of documents
accessed. Only 2.6% accessed when
k=10 compared to intersection.
16
17. Experiments & Results
• For ranked union queries, the time
taken increases with k & q. Treaps
outperform Block-Max up to k=130
17
18. Conclusions
• New inverted index representation based on the Treaps -
An elegant and flexible tool
• Simultaneous representation of docid / weight ordering
of posting list
• Both docids & frequencies in differential form
• Significant gains in space and time
• 20 time less space/ 3X faster
18