Slides for the paper "Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval" by Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux presented at SIGIR2012
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
1. Combining Inverted Indices and
Structured Search for
Ad-hoc Object Retrieval
Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux
eXascale Infolab - University of Fribourg - Switzerland
{firstname.lastname}@unifr.ch
SIGIR2012 - Monday, August 13th 2012
2. 2
Motivation
• Lot of search engines queries are
about entities.
• Increasingly large amount of entity
data online.
• Often represented as huge graphs
• e.g. the LOD cloud, Google
Knowledge Graph, Facebook social
graph.
• Globally unique Entity identifiers
(e.g., URIs) .
• Hard to discover and/or
memorize.
3. 3
Ad-hoc Object Retrieval
(informal definition)
• “Given the description of an entity, give me back its identifier”
• Description can be keywords (e.g., “Harry Potter”).
• More than one identifier per entity (e.g., dbpedia +
freebase).
• How to evaluate returned results?
4. Ad-hoc Object Retrieval
(formal definition by Pound et al.)
• Input: unstructured query q
and data graph G.
• Output: ranked list of
resource identifiers (URIs)
from G.
• Evaluation: results (URIs)
scored by a judge with
access to all the information
contained in or linked to the
resource.
• Standard collections exist.
+
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo
ws/
1. http://harrypotter.wizards.pro/
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo
ws/
1. http://harrypotter.wizards.pro/
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u
k
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u
k
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
5. 5
Overview of Our Solution
Inverted indices on
the LOD Cloud...
...and RDF store
containing the data.
Simple NLP techniques,
Autocompletion,
Pseudo-relevance feedback
BM25,
BM25F
6. 6
Pseudo-Relevance Feedback
NLP techniques
Query auto-
completion
A Simple Example
SIGIRSIGIR
Graph traversals
Final ranking function
2. http://freebase.com/…/sigir
3. http://dbpedia.org/…/IRAQ
…
1. http://dbpedia.org/…/SIGIR
Which properties
should we follow?
How to rank new
results?
II + ranking function(s)
2. http://dbpedia.org/…/IRAQ
3. …
…
1. http://dbpedia.org/…/SIGIR
How to build
the II?
7. 7
Outline
1. Inverted Indices
2. Graph Based Entity Search
1. Object Properties vs Datatype Properties
2. Properties to Follow
3. Experimental Results
1. Experimental Setting
2. IR Techniques: Experimental Results
3. Evaluation of the Hybrid Approaches
4. Overhead of the Graph Traversal
8. 8
1. Inverted Indices (IIs)
• Simple inverted index:
• index all literals attached to each
node in the input graph.
• “movie” http://…types/film→
• Structured inverted index with three
fields:
• URI - tokenized URIs identifying
entities.
• Label - manually selected datatype
properties to textual descriptions of
the entity (e.g., label, title, name, full-
name, …).
• Attributes - all other literals.
BM25(F), query auto-completion, query extension, relevance
8
9. 9
New URIs
...
2. Graph-Based Entity Search
IR results
...
...
N
p1
p2
p_m
p1
p2
p_m
sim(e, q) > τ?
...
Assign Scores
0.284
1.428
0.556
Merged Re-
Ranked Results
...
Take top-N
docs.
Follow
links/properties
and get new
URIs.
Filter new
results by text
similarity wrt
the user query.
Scoring functions:
count sim > τ,
avg sim > τ,
Sum sim,
Avg sim,
Sum BM25 - ε
10. 10
2. 1. Object Properties vs
Datatype Properties
• Object Properties:
• connect different entities
• explore all the graph
• Datatype properties:
• give additional info about
entities
• explore just the
neighborhood of a node
11. 11
2.2. properties to follow
• RDF graph queried with SPARQL queries.
• Scope 1 queries vs Scope 2 queries.
• Set of predicates to follow selected using:
• Common sense (e.g., sameAs)
• Statistics from the data
14. 14
3.1 Experimental Setting
• SemSearch 2010 and 2011 testsets:
• Billion Triple Challenge 2009 (BTC2009)
• 1.3 billions RDF triples crawled from the LOD cloud.
• 92 and 50 queries, respectively.
• Evaluation of systems with depth-10 pooling by means of
crowdsourcing.
• Measures taken into consideration: Mean Average Precision (MAP),
Normalized Discounted Cumulative Gain (NDCG), early Precision
(P10)
15. 15
Completing Relevance by
Crowdsourcing Judgements
• We obtained relevance judgments for unjudged entities in
the top-10 results of our runs by using Amazon MTurk.
• To be fair we used the same design and settings that were
used for the AOR task of SemSearch.
18. 19
3.4. Overhead of the Graph
traversal
• Time in milliseconds
needed for each part of the
hybrid approaches.
• Measures taken on a single
machine with cold cache.
Surprisingly small
overhead (17% for best
results).
19. 20
Conclusions
• AOR = “Given the description of an entity, give me back its identifier”
• Disappointing results using simple IR techniques for AOR task.
• Hybrid system for AOR:
• combining classic IR techniques + structured database storing graph
data.
• Our evaluation shows that the new approach leads to significantly better
results (up to +25% MAP over BM25 baseline).
• For the best working configuration found, the overhead caused from the
graph traversal part is limited (17% more than running the chosen
baseline).
20. 21
Thank you for your attention
• You can find the new relevance judgments at
http://diuf.unifr.ch/xi/HybridAOR.
• More info at www.exascale.info.
• In the following days you’ll find our paper, this presentation,
and the new crowdsourced relevance judgements at
www.exascale.info/AOR.
Notas do Editor
lot of search engines queries are about entities (more than a half) there is the task...
tell that literals are strings attached to some node
just the only scoring function
tell what same as is
I dati sono un grafo , l ’ indice invertito ci dà un entry point e poi camminiam
TREC like collection/testset depth 10 pooling tutti lo conoscono qui!
Say that simple index is “ or ” , UL, LA, ULA is “ and ” Say disappointment with first result with BM25: we tried to do just II but didn ’ t work, and then we decided to go for graph… NO GOOGLE
Compare JUST s_1 with s_2 (lower recall but higher precision)
s2_3 doesn ’ t follow wikilinks. Indicies and database were resident in the machine. We didn ’ t focus on efficiency