Reverted Indexing for Expansion and Feedback

Reverted Indexing for
Feedback and Expansion
Jeremy Pickens, Matthew Cooper,
Gene Golovchinsky

Reverted Indexing
for Feedback and Expansion
Jeremy Pickens
Catalyst Repository Systems

Query-Document Duality has long history
• Using queries to label documents
• Queries and documents as bipartite graph
– Used for random walks
– Used for partitioning
• Reverse Querying

Motivation – Three R’s
Retrievability
Reuse (Algorithmic)
Recall-Oriented
Tasks

Our Key Contribution
We treat query result sets as unstructured
text “documents” -- and index them

Outline
• Reverted Documents
• Reverted Indexing
• Experimental Setup
• Results
– Effectiveness
– Efficiency
• Related Work
• Future Extensions

Reverted Document
Query
Expression
Ranking
Algorithm
Results
(docid)
Results
(score)
ID
(Basis Query)
Body

Basis Query
(Reverted Document ID)
Query
Expression
Ranking
Algorithm
giraffe BM25
cheetah BM25
gazelle BM25
gazelle Language Model
gazelle PL2 (Divergence from Randomness)
gazelle Y
gazelle B
gazelle G
fast cheetah BM25
cheetah AND NOT gazelle Boolean
Latitude+Longitude of Zanzibar Euclidean distance

Reverted Document Body
Results
(docid)
Results
(score)
Canonical URL and/or
docid
1. Probability of Relevance
2. Cosine similarity
3. KL Divergence
4. Raw Rank
5. 1 or 0 (Boolean)

rank docid score shift-scale Ahn&Moffat
1 #415 0.82 10.0 10
2 #32 0.73 8.92 9
3 #63 0.62 7.57 8
4 #7 0.49 5.95 6
5 #56 0.35 4.24 4
6 #12 0.14 1.72 2
7 #108 0.12 1.36 1
8 #115 0.09 1.09 1
9 #42 0.08 1.0 1
10 #85 0.08 1.0 1
Result Set→Document Body

Result Set→Document Body
docid Ahn&Moffat
#415 10
#32 9
#63 8
#7 6
#56 4
#12 2
#108 1
#115 1
#42 1
#85 1
<text>
415 415 415 415 415
415 415 415 415 415 32
32 32 32 32 32 32 32 32
63 63 63 63 63 63 63 63
7 7 7 7 7 7 56 56 56 56
12 12 108 115 42 85
</text>

Reverted Document
<document>
<docid>
[gazelle : BM25]
</docid>
<text>
415 415 415 415 415 415 415 415 415 415
32 32 32 32 32 32 32 32 32 63 63 63 63 63
63 63 63 7 7 7 7 7 7 56 56 56 56 12 12 108
115 42 85
</text>
</document>

Reverted Indexing
1. Choose a set of basis queries
2. For each basis query:
1. Execute each query, producing results up to
cutoff depth k
2. Use results to create a “reverted document”
3. Add the reverted document to the index
How basis queries are chosen (in these experiments):
All singleton terms (unigrams) with df ≥ 2. Ranking
algorithm for all basis queries is PL2.

Reverted Index Statistics
Retrieval Score of docid Term Frequency
Sum of Retrieval Scores
of all docids retrieved by
a Basis Query
Document Length
Number of Basis
Queries that docid was
retrieved by
Document Frequency

Experiment: Relevance Feedback
1. Run initial query using PL2 (Terrier platform)
[poaching wildlife preserves]
2. Judge top k documents for relevance
3.
4. Expand using top 500 terms (strongest baseline @ 500)
5. Run expanded query using PL2
6. Evaluate
Use KL Divergence
to select and weight
query expansion
terms
Use Bo1 to select
and weight query
expansion terms
Use PL2 retrieval on
the Reverted Index
to select and weight
query expansion
terms

Reverted Index→Expansion
1. Original query = [poaching wildlife preserves]
2. Reverted query = [#415 #56 #42 #85]
3. Expanded query = [poaching^2.0 wildlife^1.24
preserves^1.0 poachers^0.57 tsavo^0.56
leakey^0.41 tusks^0.39 …]
term original retrieved weight
poaching 1 1.0 2.0
poachers 0 0.57 0.57
tsavo 0 0.56 0.56
leakey 0 0.41 0.41
tusks 0 0.39 0.39
elephants 0 0.34 0.34
wildlife 1 0.24 1.24
kws 0 0.2 0.2
… … … …
preserves 1 0 1.0

Efficiency
• Two components to query expansion
– Selection and Weighting
– Execution of Expanded Query

Why would execution be faster?

Bo1 Reverted_PL2
Term Score Term Score
leakey 0.88 poaching 1.00
poaching 0.74 poachers 0.56
wildlife 0.73 tsavo 0.56
kenya 0.52 leakey 0.41
ivory 0.47 tusks 0.39
elephants 0.46 elephants 0.34
elephant 0.32 wildlife 0.24
deer 0.30 kws 0.20
poachers 0.28 kez 0.17
conservation 0.27 ivory 0.14
species 0.23 jealousies 0.14
tusks 0.19 elephant 0.14
african 0.19 conservationists 0.09
namibia 0.19 kenya 0.09
animals 0.17 fiefdom 0.08
africa 0.15 safaris 0.04
zimbabwe 0.15 conservationist 0.03
tsavo 0.14 egos 0.01
kenyan 0.13 kierie 0.00
conservationists 0.12 aphrodisiacs 0.00

Bo1 Reverted_PL2
Term DF Term DF
africa 20390 wildlife 2891
african 10636 kenya 1163
conservation 4298 ivory 1014
animals 3928 elephant 743
species 3479 elephants 356
wildlife 2891 poaching 331
kenya 1163 conservationists 293
ivory 1014 egos 269
zimbabwe 966 kez 173
deer 748 fiefdom 129
elephant 743 conservationist 125
namibia 483 poachers 117
kenyan 436 safaris 57
elephants 356 jealousies 56
poaching 331 tusks 42
conservationists 293 leakey 22
poachers 117 tsavo 12
tusks 42 aphrodisiacs 12
leakey 22 kws 9
tsavo 12 kierie 2
Average DF 2617 Average DF 391

Bo1 Reverted_PL2
Term DF Term DF
los 46748 transportation 15262
angeles 45147 freeway 3506
metro 39849 tunnel 2643
safety 22569 disasters 1822
fire 21257 subway 805
foot 13120 extinguished 452
traffic 12410 rtd 227
feet 12034 caved 193
hollywood 7677 shoring 158
heat 6004 roper 147
rail 5747 timbers 98
downtown 5390 shored 97
engineers 4308 pilgrimages 73
freeway 3506 asphyxiation 71
disasters 1822 smolder 29
firefighters 1489 busway 22
subway 805 grouting 21
rtd 227 smoldered 19
timbers 98 lutgen 10
busway 22 droped 2
Average DF 12511 Average DF 1283

Related Work
Inspiration:
“Retrievability: An Evaluation Measure for
Higher Order Information Access Tasks” --
Azzopardi and Vinay, CIKM 2008
Azzopardi & Vinay take a document centric
approach, examining whether documents
(n)ever appear among top k results to any query

Related Work
Query-Document Duality has long history
– S. E. Robertson. “Query-Document Symmetry
and Dual models.” Journal of Documentation,
50(3),1994
– B. Billerbeck, F. Scholer, H. E. Williams, and
J. Zobel. Query Expansion Using Associated
Queries. CIKM '03
– N. Craswell and M. Szummer. Random walks
on the Query-Click Graph. SIGIR 2007
– Reverse Querying / alerting (various)

Future Extensions
Basis queries
– Query expression may be arbitrarily complex
– Ranking function may be arbitrarily complex
(remember: ranking function is a part of the
basis query)
Reverted queries
– Best Match: [#415 #56 #42 #85]
– Boolean: (#415 AND #56) OR (#42 AND #85)
– Other query operators:
[SYNONYM(#415 #56) #42 #85]
[ORDERED(#415 #56) #42 #85]

Reverted Indexing for Expansion and Feedback

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Reverted Indexing for Expansion and Feedback

Semelhante a Reverted Indexing for Expansion and Feedback (20)

Último

Último (20)

Reverted Indexing for Expansion and Feedback

Notas do Editor