SlideShare a Scribd company logo
1 of 19
Faster and Smaller 
Inverted Indices with 
Treaps 
SD Nelson 148232M 
DMI De Silva 148207R
Outline 
• Introduction 
• Basic Concepts 
• Related Work 
• Treap Usage 
• Experiments & Results 
• Conclusions 
2
Introduction 
• New Representation of inverted index, based on 
Treap data structure 
• Two main challenges in Modern Information retrieval 
systems 
• Manage huge amounts of data 
• Return very precise results in response to user queries 
• Two-stage ranking process 
• fast and simple extract with hundreds/thousands from billions of 
documents 
• complex learned ranking to reduce candidate set 
• Focus on improving the efficiency of the first stage 
3
Introduction 
• Two approaches for first stage 
• Ranked intersection 
Boolean intersection & computation of scores for documents 
• Ranked union 
Approximate form, avoiding a costly Boolean union 
• New compressed representation for posting lists 
• Performs ranked intersections & (exact) unions directly 
• Based on the Treap data structure 
• Allows to differentially encode both document identifiers and 
weights 
4
Basic Concepts 
• Inverted index for efficient processing of ranked and 
Boolean queries 
• Index Store vocabulary of the collection 
• Document identifier (docid) 
• Weight of the term 
• Idea of achieve compression to differentially encode 
either the document identifiers/ weights 
• New in-memory posting list implementation instead of 
traditional disk storing. 
5
Related Work 
• Two query processing strategies 
• Term-at-a-time (TAAT) - one posting list after the other, shortest 
to longest 
• Document-at-a-time (DAAT) - lists are processed in parallel 
looking for the same document in all. 
• Ranked intersection strategies employ full Boolean 
intersection 
• followed by a post processing step for ranking 
• Strategies used for ranked union and intersection queries 
in the paper can be classified as DAAT 
6
Related Work 
• Two approaches : Block-Max 
• Special-purpose structure for ranked intersections and unions 
• Sorts the list by Increasing docid, cuts lists into blocks, and 
stores the maximumweight for each block 
• Enables to skip whole blocks whose maximum possible 
contribution is very low, by comparing its maximum weight with 
a threshold 
• Obtains considerable performance gains over the previous 
techniques for exact ranked unions/ ranked intersections 
• New technique can be seen as a generalization of the 
block max concept 
7
Related Work 
• Two approaches : Dual-sorted inverted lists 
• Sorted by decreasing frequency, using a wavelet tree data 
structure 
• TAAT processing for approximate ranked unions, DAAT-like 
processing for (exact) ranked intersections. 
• Ability sort by both docids and weights simultaneously 
• Not aware the frequencies until reaching the individual 
documents 
• Treaps give an upper bound to the frequencies in the 
current interval 
• Treap uses less space - Dual-Sorted can’t use differential 
encoding on docids. 
8
TREAPS - Basic Usage 
• Treap representation of a posting list. 
• Search key – document id 
• Max heap property – term frequency (weight) 
9
TREAPS - Compacted Tree 
• More compact tree topology 
representation via a general tree 
• Introduce fake root node to 
general tree 
• Treap root is the first child of fake 
root node 
• Left child of a Treap node first 
child in general tree 
• Right child of a Treap node next 
sibling 
• Dashed lines shows original tree 
• Represent topology using 
balanced parenthesis 
representation. 
10
TREAPS - Differential Encoding 
• Calculate docid, frequency differences for each node 
• For VL , 
• docid -> id(U) – id(VL) 
• freq -> f(U) – f(VL) 
• For VR, 
• docid -> id(VR) – id(U) 
• freq -> f(U) – f(VR) 
U 
VL VR 
• Store the differences instead of the actual values using 
DAC (Direct Addressable Codes) 
11
TREAPS - Improvements 
• Use of a single DAC for both docids, frequencies 
• Making the tree of balanced by choosing the maximum 
frequency closest to the center of the interval 
• Omit all nodes having frequency below some threshold 
12
TREAPS – Query Processing 
• Given query ‘Q’ composed of ‘q’ no of terms ‘t’ (t є Q) 
• Traverse ‘q’ treaps accumulating weights for each term ‘t’ for 
each document 
• Insert each document into a priority queue of size ‘k’ 
• If queue size ‘k+1’ remove the minimum 
• Queue size ‘k’ - use minimum score as a lower bound, 
discard documents to be checked during ‘intersection’. 
• Since treaps maintain max frequency can discard all 
nodes below a particular node. 
13
Experiments & Results 
• Experimental setup 
• TREC GOV2 collection – 25.2 million documents, 32.8 million 
terms, 4.9 billion postings 
• Intel Xeon 2.4GHz / 96GB RAM / 12MB cache 
• Compared against other implementations 
• Block-Max 
• Dual-Sorted 
• Traditional docid-sorted inverted index 
• Traditional frequency-sorted inverted index 
14
Experiments & Results 
• Using differential encoding alone 
is not sufficient – ‘Treap w/o f0’ 
still has high space usage 
• Omitting low frequency items 
from treaps offers lowest space 
usage (Treap) 
• 22% than Block-Max 
• 18% then Dual-Sorted 
15
Experiments & Results 
• Treaps effective for small ‘k’ (k < 30), 
3x faster for ranked intersection. 
• Treaps affected by ‘k’ unlike Block- 
Max, Dual-Sorted. 
• Explained by no of documents 
accessed. Only 2.6% accessed when 
k=10 compared to intersection. 
16
Experiments & Results 
• For ranked union queries, the time 
taken increases with k & q. Treaps 
outperform Block-Max up to k=130 
17
Conclusions 
• New inverted index representation based on the Treaps - 
An elegant and flexible tool 
• Simultaneous representation of docid / weight ordering 
of posting list 
• Both docids & frequencies in differential form 
• Significant gains in space and time 
• 20 time less space/ 3X faster 
18
Thank You 
19

More Related Content

What's hot

Cassandra internals
Cassandra internalsCassandra internals
Cassandra internalsnarsiman
 
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresEfficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresHan Li
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)saumo
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...DataStax
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial Na Zhu
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraSoftwareMill
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...DataStax
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in CassandraShogo Hoshii
 
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...DataStax
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose themDatio Big Data
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamojbellis
 
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at ElasticDataconomy Media
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overviewPritamKathar
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraNguyen Quang
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
 

What's hot (20)

Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Efficient node bootstrapping for decentralised shared-nothing Key-Value StoresEfficient node bootstrapping for decentralised shared-nothing Key-Value Stores
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
Presentation on Bigdata (Energy Efficient Failure Recovery in Hadoop)
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
"Databases - The Choice is Yours", Philipp Krenn, Developer Advocate at Elastic
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 

Viewers also liked

Byzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth ChaudhryByzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth ChaudhrySiddharth Chaudhry
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
Mekwa from the web to the iphone
Mekwa from the web to the iphoneMekwa from the web to the iphone
Mekwa from the web to the iphoneguestf0346b
 
Implementation Of Byzantine Fault Tolerant Algorithm on WSN
Implementation Of Byzantine Fault Tolerant Algorithm on WSNImplementation Of Byzantine Fault Tolerant Algorithm on WSN
Implementation Of Byzantine Fault Tolerant Algorithm on WSNShatadru Chattopadhyay
 
Stanford online courses 2014
Stanford online courses 2014Stanford online courses 2014
Stanford online courses 2014Neeraj Mandhana
 
DockerCon SF 2015: The Distributed System Toolkit
DockerCon SF 2015: The Distributed System ToolkitDockerCon SF 2015: The Distributed System Toolkit
DockerCon SF 2015: The Distributed System ToolkitDocker, Inc.
 

Viewers also liked (9)

Byzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth ChaudhryByzantine General Problem - Siddharth Chaudhry
Byzantine General Problem - Siddharth Chaudhry
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
Mekwa from the web to the iphone
Mekwa from the web to the iphoneMekwa from the web to the iphone
Mekwa from the web to the iphone
 
Algorithms
AlgorithmsAlgorithms
Algorithms
 
Implementation Of Byzantine Fault Tolerant Algorithm on WSN
Implementation Of Byzantine Fault Tolerant Algorithm on WSNImplementation Of Byzantine Fault Tolerant Algorithm on WSN
Implementation Of Byzantine Fault Tolerant Algorithm on WSN
 
Stanford online courses 2014
Stanford online courses 2014Stanford online courses 2014
Stanford online courses 2014
 
Data structures
Data structuresData structures
Data structures
 
DockerCon SF 2015: The Distributed System Toolkit
DockerCon SF 2015: The Distributed System ToolkitDockerCon SF 2015: The Distributed System Toolkit
DockerCon SF 2015: The Distributed System Toolkit
 
Byzantine Generals
Byzantine GeneralsByzantine Generals
Byzantine Generals
 

Similar to Faster and smaller inverted indices with Treaps Research Paper

Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Boris Yen
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache CassandraJacky Chu
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.pptDanBarcan2
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017Alex Robinson
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networksbalmanme
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme
 
Incremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF GraphsIncremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF GraphsNikolaos Konstantinou
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta MapR Technologies
 
24-ad-hoc.ppt
24-ad-hoc.ppt24-ad-hoc.ppt
24-ad-hoc.pptsumadi26
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data BaseSiva Rushi
 
Training Slides: Basics 101: Introduction to Tungsten Replicator
Training Slides: Basics 101: Introduction to Tungsten ReplicatorTraining Slides: Basics 101: Introduction to Tungsten Replicator
Training Slides: Basics 101: Introduction to Tungsten ReplicatorContinuent
 

Similar to Faster and smaller inverted indices with Treaps Research Paper (20)

Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Cassandra
CassandraCassandra
Cassandra
 
14-7810-20.ppt
14-7810-20.ppt14-7810-20.ppt
14-7810-20.ppt
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
 
Incremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF GraphsIncremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF Graphs
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
24-ad-hoc.ppt
24-ad-hoc.ppt24-ad-hoc.ppt
24-ad-hoc.ppt
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
 
Training Slides: Basics 101: Introduction to Tungsten Replicator
Training Slides: Basics 101: Introduction to Tungsten ReplicatorTraining Slides: Basics 101: Introduction to Tungsten Replicator
Training Slides: Basics 101: Introduction to Tungsten Replicator
 

Recently uploaded

Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 

Recently uploaded (20)

Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 

Faster and smaller inverted indices with Treaps Research Paper

  • 1. Faster and Smaller Inverted Indices with Treaps SD Nelson 148232M DMI De Silva 148207R
  • 2. Outline • Introduction • Basic Concepts • Related Work • Treap Usage • Experiments & Results • Conclusions 2
  • 3. Introduction • New Representation of inverted index, based on Treap data structure • Two main challenges in Modern Information retrieval systems • Manage huge amounts of data • Return very precise results in response to user queries • Two-stage ranking process • fast and simple extract with hundreds/thousands from billions of documents • complex learned ranking to reduce candidate set • Focus on improving the efficiency of the first stage 3
  • 4. Introduction • Two approaches for first stage • Ranked intersection Boolean intersection & computation of scores for documents • Ranked union Approximate form, avoiding a costly Boolean union • New compressed representation for posting lists • Performs ranked intersections & (exact) unions directly • Based on the Treap data structure • Allows to differentially encode both document identifiers and weights 4
  • 5. Basic Concepts • Inverted index for efficient processing of ranked and Boolean queries • Index Store vocabulary of the collection • Document identifier (docid) • Weight of the term • Idea of achieve compression to differentially encode either the document identifiers/ weights • New in-memory posting list implementation instead of traditional disk storing. 5
  • 6. Related Work • Two query processing strategies • Term-at-a-time (TAAT) - one posting list after the other, shortest to longest • Document-at-a-time (DAAT) - lists are processed in parallel looking for the same document in all. • Ranked intersection strategies employ full Boolean intersection • followed by a post processing step for ranking • Strategies used for ranked union and intersection queries in the paper can be classified as DAAT 6
  • 7. Related Work • Two approaches : Block-Max • Special-purpose structure for ranked intersections and unions • Sorts the list by Increasing docid, cuts lists into blocks, and stores the maximumweight for each block • Enables to skip whole blocks whose maximum possible contribution is very low, by comparing its maximum weight with a threshold • Obtains considerable performance gains over the previous techniques for exact ranked unions/ ranked intersections • New technique can be seen as a generalization of the block max concept 7
  • 8. Related Work • Two approaches : Dual-sorted inverted lists • Sorted by decreasing frequency, using a wavelet tree data structure • TAAT processing for approximate ranked unions, DAAT-like processing for (exact) ranked intersections. • Ability sort by both docids and weights simultaneously • Not aware the frequencies until reaching the individual documents • Treaps give an upper bound to the frequencies in the current interval • Treap uses less space - Dual-Sorted can’t use differential encoding on docids. 8
  • 9. TREAPS - Basic Usage • Treap representation of a posting list. • Search key – document id • Max heap property – term frequency (weight) 9
  • 10. TREAPS - Compacted Tree • More compact tree topology representation via a general tree • Introduce fake root node to general tree • Treap root is the first child of fake root node • Left child of a Treap node first child in general tree • Right child of a Treap node next sibling • Dashed lines shows original tree • Represent topology using balanced parenthesis representation. 10
  • 11. TREAPS - Differential Encoding • Calculate docid, frequency differences for each node • For VL , • docid -> id(U) – id(VL) • freq -> f(U) – f(VL) • For VR, • docid -> id(VR) – id(U) • freq -> f(U) – f(VR) U VL VR • Store the differences instead of the actual values using DAC (Direct Addressable Codes) 11
  • 12. TREAPS - Improvements • Use of a single DAC for both docids, frequencies • Making the tree of balanced by choosing the maximum frequency closest to the center of the interval • Omit all nodes having frequency below some threshold 12
  • 13. TREAPS – Query Processing • Given query ‘Q’ composed of ‘q’ no of terms ‘t’ (t є Q) • Traverse ‘q’ treaps accumulating weights for each term ‘t’ for each document • Insert each document into a priority queue of size ‘k’ • If queue size ‘k+1’ remove the minimum • Queue size ‘k’ - use minimum score as a lower bound, discard documents to be checked during ‘intersection’. • Since treaps maintain max frequency can discard all nodes below a particular node. 13
  • 14. Experiments & Results • Experimental setup • TREC GOV2 collection – 25.2 million documents, 32.8 million terms, 4.9 billion postings • Intel Xeon 2.4GHz / 96GB RAM / 12MB cache • Compared against other implementations • Block-Max • Dual-Sorted • Traditional docid-sorted inverted index • Traditional frequency-sorted inverted index 14
  • 15. Experiments & Results • Using differential encoding alone is not sufficient – ‘Treap w/o f0’ still has high space usage • Omitting low frequency items from treaps offers lowest space usage (Treap) • 22% than Block-Max • 18% then Dual-Sorted 15
  • 16. Experiments & Results • Treaps effective for small ‘k’ (k < 30), 3x faster for ranked intersection. • Treaps affected by ‘k’ unlike Block- Max, Dual-Sorted. • Explained by no of documents accessed. Only 2.6% accessed when k=10 compared to intersection. 16
  • 17. Experiments & Results • For ranked union queries, the time taken increases with k & q. Treaps outperform Block-Max up to k=130 17
  • 18. Conclusions • New inverted index representation based on the Treaps - An elegant and flexible tool • Simultaneous representation of docid / weight ordering of posting list • Both docids & frequencies in differential form • Significant gains in space and time • 20 time less space/ 3X faster 18