SlideShare a Scribd company logo
1 of 19
Faster and Smaller 
Inverted Indices with 
Treaps 
SD Nelson 148232M 
DMI De Silva 148207R
Outline 
• Introduction 
• Basic Concepts 
• Related Work 
• Treap Usage 
• Experiments & Results 
• Conclusions 
2
Introduction 
• New Representation of inverted index, based on 
Treap data structure 
• Two main challenges in Modern Information retrieval 
systems 
• Manage huge amounts of data 
• Return very precise results in response to user queries 
• Two-stage ranking process 
• fast and simple extract with hundreds/thousands from billions of 
documents 
• complex learned ranking to reduce candidate set 
• Focus on improving the efficiency of the first stage 
3
Introduction 
• Two approaches for first stage 
• Ranked intersection 
Boolean intersection & computation of scores for documents 
• Ranked union 
Approximate form, avoiding a costly Boolean union 
• New compressed representation for posting lists 
• Performs ranked intersections & (exact) unions directly 
• Based on the Treap data structure 
• Allows to differentially encode both document identifiers and 
weights 
4
Basic Concepts 
• Inverted index for efficient processing of ranked and 
Boolean queries 
• Index Store vocabulary of the collection 
• Document identifier (docid) 
• Weight of the term 
• Idea of achieve compression to differentially encode 
either the document identifiers/ weights 
• New in-memory posting list implementation instead of 
traditional disk storing. 
5
Related Work 
• Two query processing strategies 
• Term-at-a-time (TAAT) - one posting list after the other, shortest 
to longest 
• Document-at-a-time (DAAT) - lists are processed in parallel 
looking for the same document in all. 
• Ranked intersection strategies employ full Boolean 
intersection 
• followed by a post processing step for ranking 
• Strategies used for ranked union and intersection queries 
in the paper can be classified as DAAT 
6
Related Work 
• Two approaches : Block-Max 
• Special-purpose structure for ranked intersections and unions 
• Sorts the list by Increasing docid, cuts lists into blocks, and 
stores the maximumweight for each block 
• Enables to skip whole blocks whose maximum possible 
contribution is very low, by comparing its maximum weight with 
a threshold 
• Obtains considerable performance gains over the previous 
techniques for exact ranked unions/ ranked intersections 
• New technique can be seen as a generalization of the 
block max concept 
7
Related Work 
• Two approaches : Dual-sorted inverted lists 
• Sorted by decreasing frequency, using a wavelet tree data 
structure 
• TAAT processing for approximate ranked unions, DAAT-like 
processing for (exact) ranked intersections. 
• Ability sort by both docids and weights simultaneously 
• Not aware the frequencies until reaching the individual 
documents 
• Treaps give an upper bound to the frequencies in the 
current interval 
• Treap uses less space - Dual-Sorted can’t use differential 
encoding on docids. 
8
TREAPS - Basic Usage 
• Treap representation of a posting list. 
• Search key – document id 
• Max heap property – term frequency (weight) 
9
TREAPS - Compacted Tree 
• More compact tree topology 
representation via a general tree 
• Introduce fake root node to 
general tree 
• Treap root is the first child of fake 
root node 
• Left child of a Treap node first 
child in general tree 
• Right child of a Treap node next 
sibling 
• Dashed lines shows original tree 
• Represent topology using 
balanced parenthesis 
representation. 
10
TREAPS - Differential Encoding 
• Calculate docid, frequency differences for each node 
• For VL , 
• docid -> id(U) – id(VL) 
• freq -> f(U) – f(VL) 
• For VR, 
• docid -> id(VR) – id(U) 
• freq -> f(U) – f(VR) 
U 
VL VR 
• Store the differences instead of the actual values using 
DAC (Direct Addressable Codes) 
11
TREAPS - Improvements 
• Use of a single DAC for both docids, frequencies 
• Making the tree of balanced by choosing the maximum 
frequency closest to the center of the interval 
• Omit all nodes having frequency below some threshold 
12
TREAPS – Query Processing 
• Given query ‘Q’ composed of ‘q’ no of terms ‘t’ (t є Q) 
• Traverse ‘q’ treaps accumulating weights for each term ‘t’ for 
each document 
• Insert each document into a priority queue of size ‘k’ 
• If queue size ‘k+1’ remove the minimum 
• Queue size ‘k’ - use minimum score as a lower bound, 
discard documents to be checked during ‘intersection’. 
• Since treaps maintain max frequency can discard all 
nodes below a particular node. 
13
Experiments & Results 
• Experimental setup 
• TREC GOV2 collection – 25.2 million documents, 32.8 million 
terms, 4.9 billion postings 
• Intel Xeon 2.4GHz / 96GB RAM / 12MB cache 
• Compared against other implementations 
• Block-Max 
• Dual-Sorted 
• Traditional docid-sorted inverted index 
• Traditional frequency-sorted inverted index 
14
Experiments & Results 
• Using differential encoding alone 
is not sufficient – ‘Treap w/o f0’ 
still has high space usage 
• Omitting low frequency items 
from treaps offers lowest space 
usage (Treap) 
• 22% than Block-Max 
• 18% then Dual-Sorted 
15
Experiments & Results 
• Treaps effective for small ‘k’ (k < 30), 
3x faster for ranked intersection. 
• Treaps affected by ‘k’ unlike Block- 
Max, Dual-Sorted. 
• Explained by no of documents 
accessed. Only 2.6% accessed when 
k=10 compared to intersection. 
16
Experiments & Results 
• For ranked union queries, the time 
taken increases with k & q. Treaps 
outperform Block-Max up to k=130 
17
Conclusions 
• New inverted index representation based on the Treaps - 
An elegant and flexible tool 
• Simultaneous representation of docid / weight ordering 
of posting list 
• Both docids & frequencies in differential form 
• Significant gains in space and time 
• 20 time less space/ 3X faster 
18
Thank You 
19

More Related Content

What's hot

Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
Acunu
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 

What's hot (20)

Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresEfficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 

Viewers also liked

Stanford online courses 2014
Stanford online courses 2014Stanford online courses 2014
Stanford online courses 2014
Neeraj Mandhana
 

Viewers also liked (9)

Byzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth ChaudhryByzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth Chaudhry
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
Mekwa from the web to the iphone
Mekwa from the web to the iphoneMekwa from the web to the iphone
Mekwa from the web to the iphone
 
Algorithms
AlgorithmsAlgorithms
Algorithms
 
Implementation Of Byzantine Fault Tolerant Algorithm on WSN
Implementation Of Byzantine Fault Tolerant Algorithm on WSNImplementation Of Byzantine Fault Tolerant Algorithm on WSN
Implementation Of Byzantine Fault Tolerant Algorithm on WSN
 
Stanford online courses 2014
Stanford online courses 2014Stanford online courses 2014
Stanford online courses 2014
 
Data structures
Data structuresData structures
Data structures
 
DockerCon SF 2015: The Distributed System Toolkit
DockerCon SF 2015: The Distributed System ToolkitDockerCon SF 2015: The Distributed System Toolkit
DockerCon SF 2015: The Distributed System Toolkit
 
Byzantine Generals
Byzantine GeneralsByzantine Generals
Byzantine Generals
 

Similar to Faster and smaller inverted indices with Treaps Research Paper

Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
Boris Yen
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
balmanme
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
balmanme
 
24-ad-hoc.ppt
24-ad-hoc.ppt24-ad-hoc.ppt
24-ad-hoc.ppt
sumadi26
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
Siva Rushi
 

Similar to Faster and smaller inverted indices with Treaps Research Paper (20)

Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Cassandra
CassandraCassandra
Cassandra
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
14-7810-20.ppt
14-7810-20.ppt14-7810-20.ppt
14-7810-20.ppt
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
 
Incremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF GraphsIncremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF Graphs
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
24-ad-hoc.ppt
24-ad-hoc.ppt24-ad-hoc.ppt
24-ad-hoc.ppt
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
 

Recently uploaded

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 

Recently uploaded (20)

DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 

Faster and smaller inverted indices with Treaps Research Paper

  • 1. Faster and Smaller Inverted Indices with Treaps SD Nelson 148232M DMI De Silva 148207R
  • 2. Outline • Introduction • Basic Concepts • Related Work • Treap Usage • Experiments & Results • Conclusions 2
  • 3. Introduction • New Representation of inverted index, based on Treap data structure • Two main challenges in Modern Information retrieval systems • Manage huge amounts of data • Return very precise results in response to user queries • Two-stage ranking process • fast and simple extract with hundreds/thousands from billions of documents • complex learned ranking to reduce candidate set • Focus on improving the efficiency of the first stage 3
  • 4. Introduction • Two approaches for first stage • Ranked intersection Boolean intersection & computation of scores for documents • Ranked union Approximate form, avoiding a costly Boolean union • New compressed representation for posting lists • Performs ranked intersections & (exact) unions directly • Based on the Treap data structure • Allows to differentially encode both document identifiers and weights 4
  • 5. Basic Concepts • Inverted index for efficient processing of ranked and Boolean queries • Index Store vocabulary of the collection • Document identifier (docid) • Weight of the term • Idea of achieve compression to differentially encode either the document identifiers/ weights • New in-memory posting list implementation instead of traditional disk storing. 5
  • 6. Related Work • Two query processing strategies • Term-at-a-time (TAAT) - one posting list after the other, shortest to longest • Document-at-a-time (DAAT) - lists are processed in parallel looking for the same document in all. • Ranked intersection strategies employ full Boolean intersection • followed by a post processing step for ranking • Strategies used for ranked union and intersection queries in the paper can be classified as DAAT 6
  • 7. Related Work • Two approaches : Block-Max • Special-purpose structure for ranked intersections and unions • Sorts the list by Increasing docid, cuts lists into blocks, and stores the maximumweight for each block • Enables to skip whole blocks whose maximum possible contribution is very low, by comparing its maximum weight with a threshold • Obtains considerable performance gains over the previous techniques for exact ranked unions/ ranked intersections • New technique can be seen as a generalization of the block max concept 7
  • 8. Related Work • Two approaches : Dual-sorted inverted lists • Sorted by decreasing frequency, using a wavelet tree data structure • TAAT processing for approximate ranked unions, DAAT-like processing for (exact) ranked intersections. • Ability sort by both docids and weights simultaneously • Not aware the frequencies until reaching the individual documents • Treaps give an upper bound to the frequencies in the current interval • Treap uses less space - Dual-Sorted can’t use differential encoding on docids. 8
  • 9. TREAPS - Basic Usage • Treap representation of a posting list. • Search key – document id • Max heap property – term frequency (weight) 9
  • 10. TREAPS - Compacted Tree • More compact tree topology representation via a general tree • Introduce fake root node to general tree • Treap root is the first child of fake root node • Left child of a Treap node first child in general tree • Right child of a Treap node next sibling • Dashed lines shows original tree • Represent topology using balanced parenthesis representation. 10
  • 11. TREAPS - Differential Encoding • Calculate docid, frequency differences for each node • For VL , • docid -> id(U) – id(VL) • freq -> f(U) – f(VL) • For VR, • docid -> id(VR) – id(U) • freq -> f(U) – f(VR) U VL VR • Store the differences instead of the actual values using DAC (Direct Addressable Codes) 11
  • 12. TREAPS - Improvements • Use of a single DAC for both docids, frequencies • Making the tree of balanced by choosing the maximum frequency closest to the center of the interval • Omit all nodes having frequency below some threshold 12
  • 13. TREAPS – Query Processing • Given query ‘Q’ composed of ‘q’ no of terms ‘t’ (t є Q) • Traverse ‘q’ treaps accumulating weights for each term ‘t’ for each document • Insert each document into a priority queue of size ‘k’ • If queue size ‘k+1’ remove the minimum • Queue size ‘k’ - use minimum score as a lower bound, discard documents to be checked during ‘intersection’. • Since treaps maintain max frequency can discard all nodes below a particular node. 13
  • 14. Experiments & Results • Experimental setup • TREC GOV2 collection – 25.2 million documents, 32.8 million terms, 4.9 billion postings • Intel Xeon 2.4GHz / 96GB RAM / 12MB cache • Compared against other implementations • Block-Max • Dual-Sorted • Traditional docid-sorted inverted index • Traditional frequency-sorted inverted index 14
  • 15. Experiments & Results • Using differential encoding alone is not sufficient – ‘Treap w/o f0’ still has high space usage • Omitting low frequency items from treaps offers lowest space usage (Treap) • 22% than Block-Max • 18% then Dual-Sorted 15
  • 16. Experiments & Results • Treaps effective for small ‘k’ (k < 30), 3x faster for ranked intersection. • Treaps affected by ‘k’ unlike Block- Max, Dual-Sorted. • Explained by no of documents accessed. Only 2.6% accessed when k=10 compared to intersection. 16
  • 17. Experiments & Results • For ranked union queries, the time taken increases with k & q. Treaps outperform Block-Max up to k=130 17
  • 18. Conclusions • New inverted index representation based on the Treaps - An elegant and flexible tool • Simultaneous representation of docid / weight ordering of posting list • Both docids & frequencies in differential form • Significant gains in space and time • 20 time less space/ 3X faster 18