SlideShare uma empresa Scribd logo
1 de 41
Reverted Indexing for
Feedback and Expansion
Jeremy Pickens, Matthew Cooper,
Gene Golovchinsky
Reverted Indexing
for Feedback and Expansion
Jeremy Pickens
Catalyst Repository Systems
Query-Document Duality has long history
• Using queries to label documents
• Queries and documents as bipartite graph
– Used for random walks
– Used for partitioning
• Reverse Querying
Motivation – Three R’s
Retrievability
Reuse (Algorithmic)
Recall-Oriented
Tasks
Our Key Contribution
We treat query result sets as unstructured
text “documents” -- and index them
Outline
• Reverted Documents
• Reverted Indexing
• Experimental Setup
• Results
– Effectiveness
– Efficiency
• Related Work
• Future Extensions
Reverted Document
Query
Expression
Ranking
Algorithm
Results
(docid)
Results
(score)
ID
(Basis Query)
Body
Basis Query
(Reverted Document ID)
Query
Expression
Ranking
Algorithm
giraffe BM25
cheetah BM25
gazelle BM25
gazelle Language Model
gazelle PL2 (Divergence from Randomness)
gazelle Y
gazelle B
gazelle G
fast cheetah BM25
cheetah AND NOT gazelle Boolean
Latitude+Longitude of Zanzibar Euclidean distance
Reverted Document Body
Results
(docid)
Results
(score)
Canonical URL and/or
docid
1. Probability of Relevance
2. Cosine similarity
3. KL Divergence
4. Raw Rank
5. 1 or 0 (Boolean)
rank docid score shift-scale Ahn&Moffat
1 #415 0.82 10.0 10
2 #32 0.73 8.92 9
3 #63 0.62 7.57 8
4 #7 0.49 5.95 6
5 #56 0.35 4.24 4
6 #12 0.14 1.72 2
7 #108 0.12 1.36 1
8 #115 0.09 1.09 1
9 #42 0.08 1.0 1
10 #85 0.08 1.0 1
Result Set→Document Body
Result Set→Document Body
docid Ahn&Moffat
#415 10
#32 9
#63 8
#7 6
#56 4
#12 2
#108 1
#115 1
#42 1
#85 1
<text>
415 415 415 415 415
415 415 415 415 415 32
32 32 32 32 32 32 32 32
63 63 63 63 63 63 63 63
7 7 7 7 7 7 56 56 56 56
12 12 108 115 42 85
</text>
Reverted Document
Query
Expression
Ranking
Algorithm
Results
(docid)
Results
(score)
ID
(Basis Query)
Body
Reverted Document
<document>
<docid>
[gazelle : BM25]
</docid>
<text>
415 415 415 415 415 415 415 415 415 415
32 32 32 32 32 32 32 32 32 63 63 63 63 63
63 63 63 7 7 7 7 7 7 56 56 56 56 12 12 108
115 42 85
</text>
</document>
Questions?
Outline
• Reverted Documents
• Reverted Indexing
• Experimental Setup
• Results
– Effectiveness
– Efficiency
• Related Work
• Future Extensions
Reverted Indexing
1. Choose a set of basis queries
2. For each basis query:
1. Execute each query, producing results up to
cutoff depth k
2. Use results to create a “reverted document”
3. Add the reverted document to the index
How basis queries are chosen (in these experiments):
All singleton terms (unigrams) with df ≥ 2. Ranking
algorithm for all basis queries is PL2.
Standard Index
Reverted Index
Reverted Index Statistics
Retrieval Score of docid Term Frequency
Sum of Retrieval Scores
of all docids retrieved by
a Basis Query
Document Length
Number of Basis
Queries that docid was
retrieved by
Document Frequency
Outline
• Reverted Documents
• Reverted Indexing
• Experimental Setup
• Results
– Effectiveness
– Efficiency
• Related Work
• Future Extensions
Experiment: Relevance Feedback
1. Run initial query using PL2 (Terrier platform)
[poaching wildlife preserves]
2. Judge top k documents for relevance
3.
4. Expand using top 500 terms (strongest baseline @ 500)
5. Run expanded query using PL2
6. Evaluate
Use KL Divergence
to select and weight
query expansion
terms
Use Bo1 to select
and weight query
expansion terms
Use PL2 retrieval on
the Reverted Index
to select and weight
query expansion
terms
Reverted Index→Expansion
1. Original query = [poaching wildlife preserves]
2. Reverted query = [#415 #56 #42 #85]
3. Expanded query = [poaching^2.0 wildlife^1.24
preserves^1.0 poachers^0.57 tsavo^0.56
leakey^0.41 tusks^0.39 …]
term original retrieved weight
poaching 1 1.0 2.0
poachers 0 0.57 0.57
tsavo 0 0.56 0.56
leakey 0 0.41 0.41
tusks 0 0.39 0.39
elephants 0 0.34 0.34
wildlife 1 0.24 1.24
kws 0 0.2 0.2
… … … …
preserves 1 0 1.0
Outline
• Reverted Documents
• Reverted Indexing
• Experimental Setup
• Results
– Effectiveness
– Efficiency
• Related Work
• Future Extensions
MAP
%Change
Residual MAP
%Change
Efficiency
• Two components to query expansion
– Selection and Weighting
– Execution of Expanded Query
Avg Selection Time
Avg Execution Time
Why would execution be faster?
Bo1 Reverted_PL2
Term Score Term Score
leakey 0.88 poaching 1.00
poaching 0.74 poachers 0.56
wildlife 0.73 tsavo 0.56
kenya 0.52 leakey 0.41
ivory 0.47 tusks 0.39
elephants 0.46 elephants 0.34
elephant 0.32 wildlife 0.24
deer 0.30 kws 0.20
poachers 0.28 kez 0.17
conservation 0.27 ivory 0.14
species 0.23 jealousies 0.14
tusks 0.19 elephant 0.14
african 0.19 conservationists 0.09
namibia 0.19 kenya 0.09
animals 0.17 fiefdom 0.08
africa 0.15 safaris 0.04
zimbabwe 0.15 conservationist 0.03
tsavo 0.14 egos 0.01
kenyan 0.13 kierie 0.00
conservationists 0.12 aphrodisiacs 0.00
Bo1 Reverted_PL2
Term DF Term DF
africa 20390 wildlife 2891
african 10636 kenya 1163
conservation 4298 ivory 1014
animals 3928 elephant 743
species 3479 elephants 356
wildlife 2891 poaching 331
kenya 1163 conservationists 293
ivory 1014 egos 269
zimbabwe 966 kez 173
deer 748 fiefdom 129
elephant 743 conservationist 125
namibia 483 poachers 117
kenyan 436 safaris 57
elephants 356 jealousies 56
poaching 331 tusks 42
conservationists 293 leakey 22
poachers 117 tsavo 12
tusks 42 aphrodisiacs 12
leakey 22 kws 9
tsavo 12 kierie 2
Average DF 2617 Average DF 391
Bo1 Reverted_PL2
Term DF Term DF
los 46748 transportation 15262
angeles 45147 freeway 3506
metro 39849 tunnel 2643
safety 22569 disasters 1822
fire 21257 subway 805
foot 13120 extinguished 452
traffic 12410 rtd 227
feet 12034 caved 193
hollywood 7677 shoring 158
heat 6004 roper 147
rail 5747 timbers 98
downtown 5390 shored 97
engineers 4308 pilgrimages 73
freeway 3506 asphyxiation 71
disasters 1822 smolder 29
firefighters 1489 busway 22
subway 805 grouting 21
rtd 227 smoldered 19
timbers 98 lutgen 10
busway 22 droped 2
Average DF 12511 Average DF 1283
Outline
• Reverted Documents
• Reverted Indexing
• Experimental Setup
• Results
– Effectiveness
– Efficiency
• Related Work
• Future Extensions
Related Work
Inspiration:
“Retrievability: An Evaluation Measure for
Higher Order Information Access Tasks” --
Azzopardi and Vinay, CIKM 2008
Azzopardi & Vinay take a document centric
approach, examining whether documents
(n)ever appear among top k results to any query
Related Work
Query-Document Duality has long history
– S. E. Robertson. “Query-Document Symmetry
and Dual models.” Journal of Documentation,
50(3),1994
– B. Billerbeck, F. Scholer, H. E. Williams, and
J. Zobel. Query Expansion Using Associated
Queries. CIKM '03
– N. Craswell and M. Szummer. Random walks
on the Query-Click Graph. SIGIR 2007
– Reverse Querying / alerting (various)
Future Extensions
Basis queries
– Query expression may be arbitrarily complex
– Ranking function may be arbitrarily complex
(remember: ranking function is a part of the
basis query)
Reverted queries
– Best Match: [#415 #56 #42 #85]
– Boolean: (#415 AND #56) OR (#42 AND #85)
– Other query operators:
[SYNONYM(#415 #56) #42 #85]
[ORDERED(#415 #56) #42 #85]
Motivation – Three R’s
Retrievability
Reuse (Algorithmic)
Recall-Oriented
Tasks
Questions?

Mais conteúdo relacionado

Semelhante a Reverted Indexing for Expansion and Feedback

Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Wrokflow programming and provenance query model
Wrokflow programming and provenance query model  Wrokflow programming and provenance query model
Wrokflow programming and provenance query model Rayhan Ferdous
 
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...Institute of Information Systems (HES-SO)
 
Machine Learning Assisted Citation Screening for Systematic Reviews
Machine Learning Assisted Citation Screening for Systematic ReviewsMachine Learning Assisted Citation Screening for Systematic Reviews
Machine Learning Assisted Citation Screening for Systematic ReviewsAnjani Dhrangadhariya
 
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationLeveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationNattiya Kanhabua
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
MediaEval 2016 - IR Evaluation: Putting the User Back in the Loop
MediaEval 2016 - IR Evaluation: Putting the User Back in the LoopMediaEval 2016 - IR Evaluation: Putting the User Back in the Loop
MediaEval 2016 - IR Evaluation: Putting the User Back in the Loopmultimediaeval
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir modelsVaibhav Khanna
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Kento Aoyama
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document ClassificationAlessandro Benedetti
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksLucidworks
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptxKtonNguyn2
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)Dimitris Kontokostas
 

Semelhante a Reverted Indexing for Expansion and Feedback (20)

Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Wrokflow programming and provenance query model
Wrokflow programming and provenance query model  Wrokflow programming and provenance query model
Wrokflow programming and provenance query model
 
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
 
Machine Learning Assisted Citation Screening for Systematic Reviews
Machine Learning Assisted Citation Screening for Systematic ReviewsMachine Learning Assisted Citation Screening for Systematic Reviews
Machine Learning Assisted Citation Screening for Systematic Reviews
 
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result DiversificationLeveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
MediaEval 2016 - IR Evaluation: Putting the User Back in the Loop
MediaEval 2016 - IR Evaluation: Putting the User Back in the LoopMediaEval 2016 - IR Evaluation: Putting the User Back in the Loop
MediaEval 2016 - IR Evaluation: Putting the User Back in the Loop
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir models
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
 

Último

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Reverted Indexing for Expansion and Feedback

  • 1. Reverted Indexing for Feedback and Expansion Jeremy Pickens, Matthew Cooper, Gene Golovchinsky
  • 2. Reverted Indexing for Feedback and Expansion Jeremy Pickens Catalyst Repository Systems
  • 3. Query-Document Duality has long history • Using queries to label documents • Queries and documents as bipartite graph – Used for random walks – Used for partitioning • Reverse Querying
  • 4. Motivation – Three R’s Retrievability Reuse (Algorithmic) Recall-Oriented Tasks
  • 5. Our Key Contribution We treat query result sets as unstructured text “documents” -- and index them
  • 6. Outline • Reverted Documents • Reverted Indexing • Experimental Setup • Results – Effectiveness – Efficiency • Related Work • Future Extensions
  • 8. Basis Query (Reverted Document ID) Query Expression Ranking Algorithm giraffe BM25 cheetah BM25 gazelle BM25 gazelle Language Model gazelle PL2 (Divergence from Randomness) gazelle Y gazelle B gazelle G fast cheetah BM25 cheetah AND NOT gazelle Boolean Latitude+Longitude of Zanzibar Euclidean distance
  • 9. Reverted Document Body Results (docid) Results (score) Canonical URL and/or docid 1. Probability of Relevance 2. Cosine similarity 3. KL Divergence 4. Raw Rank 5. 1 or 0 (Boolean)
  • 10. rank docid score shift-scale Ahn&Moffat 1 #415 0.82 10.0 10 2 #32 0.73 8.92 9 3 #63 0.62 7.57 8 4 #7 0.49 5.95 6 5 #56 0.35 4.24 4 6 #12 0.14 1.72 2 7 #108 0.12 1.36 1 8 #115 0.09 1.09 1 9 #42 0.08 1.0 1 10 #85 0.08 1.0 1 Result Set→Document Body
  • 11. Result Set→Document Body docid Ahn&Moffat #415 10 #32 9 #63 8 #7 6 #56 4 #12 2 #108 1 #115 1 #42 1 #85 1 <text> 415 415 415 415 415 415 415 415 415 415 32 32 32 32 32 32 32 32 32 63 63 63 63 63 63 63 63 7 7 7 7 7 7 56 56 56 56 12 12 108 115 42 85 </text>
  • 13. Reverted Document <document> <docid> [gazelle : BM25] </docid> <text> 415 415 415 415 415 415 415 415 415 415 32 32 32 32 32 32 32 32 32 63 63 63 63 63 63 63 63 7 7 7 7 7 7 56 56 56 56 12 12 108 115 42 85 </text> </document>
  • 15. Outline • Reverted Documents • Reverted Indexing • Experimental Setup • Results – Effectiveness – Efficiency • Related Work • Future Extensions
  • 16. Reverted Indexing 1. Choose a set of basis queries 2. For each basis query: 1. Execute each query, producing results up to cutoff depth k 2. Use results to create a “reverted document” 3. Add the reverted document to the index How basis queries are chosen (in these experiments): All singleton terms (unigrams) with df ≥ 2. Ranking algorithm for all basis queries is PL2.
  • 19.
  • 20. Reverted Index Statistics Retrieval Score of docid Term Frequency Sum of Retrieval Scores of all docids retrieved by a Basis Query Document Length Number of Basis Queries that docid was retrieved by Document Frequency
  • 21. Outline • Reverted Documents • Reverted Indexing • Experimental Setup • Results – Effectiveness – Efficiency • Related Work • Future Extensions
  • 22. Experiment: Relevance Feedback 1. Run initial query using PL2 (Terrier platform) [poaching wildlife preserves] 2. Judge top k documents for relevance 3. 4. Expand using top 500 terms (strongest baseline @ 500) 5. Run expanded query using PL2 6. Evaluate Use KL Divergence to select and weight query expansion terms Use Bo1 to select and weight query expansion terms Use PL2 retrieval on the Reverted Index to select and weight query expansion terms
  • 23. Reverted Index→Expansion 1. Original query = [poaching wildlife preserves] 2. Reverted query = [#415 #56 #42 #85] 3. Expanded query = [poaching^2.0 wildlife^1.24 preserves^1.0 poachers^0.57 tsavo^0.56 leakey^0.41 tusks^0.39 …] term original retrieved weight poaching 1 1.0 2.0 poachers 0 0.57 0.57 tsavo 0 0.56 0.56 leakey 0 0.41 0.41 tusks 0 0.39 0.39 elephants 0 0.34 0.34 wildlife 1 0.24 1.24 kws 0 0.2 0.2 … … … … preserves 1 0 1.0
  • 24. Outline • Reverted Documents • Reverted Indexing • Experimental Setup • Results – Effectiveness – Efficiency • Related Work • Future Extensions
  • 25. MAP
  • 29. Efficiency • Two components to query expansion – Selection and Weighting – Execution of Expanded Query
  • 32. Why would execution be faster?
  • 33. Bo1 Reverted_PL2 Term Score Term Score leakey 0.88 poaching 1.00 poaching 0.74 poachers 0.56 wildlife 0.73 tsavo 0.56 kenya 0.52 leakey 0.41 ivory 0.47 tusks 0.39 elephants 0.46 elephants 0.34 elephant 0.32 wildlife 0.24 deer 0.30 kws 0.20 poachers 0.28 kez 0.17 conservation 0.27 ivory 0.14 species 0.23 jealousies 0.14 tusks 0.19 elephant 0.14 african 0.19 conservationists 0.09 namibia 0.19 kenya 0.09 animals 0.17 fiefdom 0.08 africa 0.15 safaris 0.04 zimbabwe 0.15 conservationist 0.03 tsavo 0.14 egos 0.01 kenyan 0.13 kierie 0.00 conservationists 0.12 aphrodisiacs 0.00
  • 34. Bo1 Reverted_PL2 Term DF Term DF africa 20390 wildlife 2891 african 10636 kenya 1163 conservation 4298 ivory 1014 animals 3928 elephant 743 species 3479 elephants 356 wildlife 2891 poaching 331 kenya 1163 conservationists 293 ivory 1014 egos 269 zimbabwe 966 kez 173 deer 748 fiefdom 129 elephant 743 conservationist 125 namibia 483 poachers 117 kenyan 436 safaris 57 elephants 356 jealousies 56 poaching 331 tusks 42 conservationists 293 leakey 22 poachers 117 tsavo 12 tusks 42 aphrodisiacs 12 leakey 22 kws 9 tsavo 12 kierie 2 Average DF 2617 Average DF 391
  • 35. Bo1 Reverted_PL2 Term DF Term DF los 46748 transportation 15262 angeles 45147 freeway 3506 metro 39849 tunnel 2643 safety 22569 disasters 1822 fire 21257 subway 805 foot 13120 extinguished 452 traffic 12410 rtd 227 feet 12034 caved 193 hollywood 7677 shoring 158 heat 6004 roper 147 rail 5747 timbers 98 downtown 5390 shored 97 engineers 4308 pilgrimages 73 freeway 3506 asphyxiation 71 disasters 1822 smolder 29 firefighters 1489 busway 22 subway 805 grouting 21 rtd 227 smoldered 19 timbers 98 lutgen 10 busway 22 droped 2 Average DF 12511 Average DF 1283
  • 36. Outline • Reverted Documents • Reverted Indexing • Experimental Setup • Results – Effectiveness – Efficiency • Related Work • Future Extensions
  • 37. Related Work Inspiration: “Retrievability: An Evaluation Measure for Higher Order Information Access Tasks” -- Azzopardi and Vinay, CIKM 2008 Azzopardi & Vinay take a document centric approach, examining whether documents (n)ever appear among top k results to any query
  • 38. Related Work Query-Document Duality has long history – S. E. Robertson. “Query-Document Symmetry and Dual models.” Journal of Documentation, 50(3),1994 – B. Billerbeck, F. Scholer, H. E. Williams, and J. Zobel. Query Expansion Using Associated Queries. CIKM '03 – N. Craswell and M. Szummer. Random walks on the Query-Click Graph. SIGIR 2007 – Reverse Querying / alerting (various)
  • 39. Future Extensions Basis queries – Query expression may be arbitrarily complex – Ranking function may be arbitrarily complex (remember: ranking function is a part of the basis query) Reverted queries – Best Match: [#415 #56 #42 #85] – Boolean: (#415 AND #56) OR (#42 AND #85) – Other query operators: [SYNONYM(#415 #56) #42 #85] [ORDERED(#415 #56) #42 #85]
  • 40. Motivation – Three R’s Retrievability Reuse (Algorithmic) Recall-Oriented Tasks

Notas do Editor

  1. My main difference: TF (=original basis query retrieval score, i.e. it’s tied to the actual performance of the system) and IDF (=just how many basis queries a docid was retrieved by). Other notes: Craswell: bi-partite click-thru graphs for random walks (manual selection no automatic retrievability) model aggregate behavior using random walks from single starting document (no notion of indexing collection) Billerbeck et al Build pseudo-documents comprised of previous queries (text only) Limited to user queries Truncates result sets; degenerate DL and IDF statistics No relevance scores Standard text search and query expansion in these pseudo-documents Just under 25% of documents in the collection had zero associations “Reverse Querying” / alerting Given a document(s?), find the queries that match it In implementations I’ve seen, this is just a Boolean proposition (matches/doesn’t match), either as a whole or in the top-k. Even if it’s in the top-k, it’s a boolean presence/absense No sense of “tf”, or of “idf”
  2. Docids become “terms”. Score becomes “term frequency”.
  3. And that’s it. We’re done. In IR, we know how to go forward from here!
  4. Add each reverted document to the index.. All-the-while calculating global and local statistics, e.g. tf and idf, etc.
  5. Why we call this a “reverted” index.
  6. PL2 as the “forward” ranking algorithm, because we determined a priori that yielded the best MAP…and we want as many relevant docs in the top k as possible. Note that everything else is held constant, except for the term expansion and weighting
  7. My main difference: TF (=original basis query retrieval score, i.e. it’s tied to the actual performance of the system) and IDF (=just how many basis queries a docid was retrieved by). Other notes: Craswell: bi-partite click-thru graphs for random walks (manual selection no automatic retrievability) model aggregate behavior using random walks from single starting document (no notion of indexing collection) Billerbeck et al Build pseudo-documents comprised of previous queries (text only) Limited to user queries Truncates result sets; degenerate DL and IDF statistics No relevance scores Standard text search and query expansion in these pseudo-documents Just under 25% of documents in the collection had zero associations “Reverse Querying” / alerting Given a document(s?), find the queries that match it In implementations I’ve seen, this is just a Boolean proposition (matches/doesn’t match), either as a whole or in the top-k. Even if it’s in the top-k, it’s a boolean presence/absense No sense of “tf”, or of “idf”
  8. There are many similarities with running docids as queries, to running terms as queries. But they’re not completely similar! (phrase operators, for example?)
  9. (Retrievability): The basis queries that we retrieve already have shown themselves as capable of retrieving (at least some) relevant documents at a high rank! Don’t need to build fancy probabilistic models to know what the “best” terms are, because the basis queries have already “recorded” it. … (Reuse): Did not have to invent any new relevance feedback models. Simply used PL2 (or BM25 or tf.idf or LM or DFRee) to do the retrieval of basis queries. Combining terms, synonym operators, etc. all possible … (Recall): Evaluated by applying it to relevance-feedback
  10. And that’s it. We’re done. In IR, we know how to go forward from here!