SlideShare uma empresa Scribd logo
1 de 20
Combining Inverted Indices and
Structured Search for
Ad-hoc Object Retrieval
Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux
eXascale Infolab - University of Fribourg - Switzerland
{firstname.lastname}@unifr.ch
SIGIR2012 - Monday, August 13th 2012
2
Motivation
• Lot of search engines queries are
about entities.
• Increasingly large amount of entity
data online.
• Often represented as huge graphs
• e.g. the LOD cloud, Google
Knowledge Graph, Facebook social
graph.
• Globally unique Entity identifiers
(e.g., URIs) .
• Hard to discover and/or
memorize.
3
Ad-hoc Object Retrieval
(informal definition)
• “Given the description of an entity, give me back its identifier”
• Description can be keywords (e.g., “Harry Potter”).
• More than one identifier per entity (e.g., dbpedia +
freebase).
• How to evaluate returned results?
Ad-hoc Object Retrieval
(formal definition by Pound et al.)
• Input: unstructured query q
and data graph G.
• Output: ranked list of
resource identifiers (URIs)
from G.
• Evaluation: results (URIs)
scored by a judge with
access to all the information
contained in or linked to the
resource.
• Standard collections exist.
+
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo
ws/
1. http://harrypotter.wizards.pro/
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo
ws/
1. http://harrypotter.wizards.pro/
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u
k
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u
k
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
5
Overview of Our Solution
Inverted indices on
the LOD Cloud...
...and RDF store
containing the data.
Simple NLP techniques,
Autocompletion,
Pseudo-relevance feedback
BM25,
BM25F
6
Pseudo-Relevance Feedback
NLP techniques
Query auto-
completion
A Simple Example
SIGIRSIGIR
Graph traversals
Final ranking function
2. http://freebase.com/…/sigir
3. http://dbpedia.org/…/IRAQ
…
1. http://dbpedia.org/…/SIGIR
Which properties
should we follow?
How to rank new
results?
II + ranking function(s)
2. http://dbpedia.org/…/IRAQ
3. …
…
1. http://dbpedia.org/…/SIGIR
How to build
the II?
7
Outline
1. Inverted Indices
2. Graph Based Entity Search
1. Object Properties vs Datatype Properties
2. Properties to Follow
3. Experimental Results
1. Experimental Setting
2. IR Techniques: Experimental Results
3. Evaluation of the Hybrid Approaches
4. Overhead of the Graph Traversal
8
1. Inverted Indices (IIs)
• Simple inverted index:
• index all literals attached to each
node in the input graph.
• “movie” http://…types/film→
• Structured inverted index with three
fields:
• URI - tokenized URIs identifying
entities.
• Label - manually selected datatype
properties to textual descriptions of
the entity (e.g., label, title, name, full-
name, …).
• Attributes - all other literals.
BM25(F), query auto-completion, query extension, relevance
8
9
New URIs
...
2. Graph-Based Entity Search
IR results
...
...
N
p1
p2
p_m
p1
p2
p_m
sim(e, q) > τ?
...
Assign Scores
0.284
1.428
0.556
Merged Re-
Ranked Results
...
Take top-N
docs.
Follow
links/properties
and get new
URIs.
Filter new
results by text
similarity wrt
the user query.
Scoring functions:
count sim > τ,
avg sim > τ,
Sum sim,
Avg sim,
Sum BM25 - ε
10
2. 1. Object Properties vs
Datatype Properties
• Object Properties:
• connect different entities
• explore all the graph
• Datatype properties:
• give additional info about
entities
• explore just the
neighborhood of a node
11
2.2. properties to follow
• RDF graph queried with SPARQL queries.
• Scope 1 queries vs Scope 2 queries.
• Set of predicates to follow selected using:
• Common sense (e.g., sameAs)
• Statistics from the data
12
properties to follow:
Two Examples
Entry point
given by the II
13
3. Experimental results
14
3.1 Experimental Setting
• SemSearch 2010 and 2011 testsets:
• Billion Triple Challenge 2009 (BTC2009)
• 1.3 billions RDF triples crawled from the LOD cloud.
• 92 and 50 queries, respectively.
• Evaluation of systems with depth-10 pooling by means of
crowdsourcing.
• Measures taken into consideration: Mean Average Precision (MAP),
Normalized Discounted Cumulative Gain (NDCG), early Precision
(P10)
15
Completing Relevance by
Crowdsourcing Judgements
• We obtained relevance judgments for unjudged entities in
the top-10 results of our runs by using Amazon MTurk.
• To be fair we used the same design and settings that were
used for the AOR task of SemSearch.
16
3.2. IR Techniques: Experimental
ResultsOur
Baseline.
18
3.3. Evaluation of Hybrid
Approaches N = 3, = 0,τ
score = sumBM25 - ε
19
3.4. Overhead of the Graph
traversal
• Time in milliseconds
needed for each part of the
hybrid approaches.
• Measures taken on a single
machine with cold cache.
Surprisingly small
overhead (17% for best
results).
20
Conclusions
• AOR = “Given the description of an entity, give me back its identifier”
• Disappointing results using simple IR techniques for AOR task.
• Hybrid system for AOR:
• combining classic IR techniques + structured database storing graph
data.
• Our evaluation shows that the new approach leads to significantly better
results (up to +25% MAP over BM25 baseline).
• For the best working configuration found, the overhead caused from the
graph traversal part is limited (17% more than running the chosen
baseline).
21
Thank you for your attention
• You can find the new relevance judgments at
http://diuf.unifr.ch/xi/HybridAOR.
• More info at www.exascale.info.
• In the following days you’ll find our paper, this presentation,
and the new crowdsourced relevance judgements at
www.exascale.info/AOR.

Mais conteúdo relacionado

Mais procurados

Data structure and its types
Data structure and its typesData structure and its types
Data structure and its types
Navtar Sidhu Brar
 
Roberto Trasarti PhD Thesis
Roberto Trasarti PhD ThesisRoberto Trasarti PhD Thesis
Roberto Trasarti PhD Thesis
Roberto Trasarti
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
FELIX75
 
Introduction of data structure
Introduction of data structureIntroduction of data structure
Introduction of data structure
eShikshak
 

Mais procurados (20)

Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
 
Data wrangling with dplyr
Data wrangling with dplyrData wrangling with dplyr
Data wrangling with dplyr
 
Data structure and its types
Data structure and its typesData structure and its types
Data structure and its types
 
Redis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph Distribution
 
Data Structure
Data StructureData Structure
Data Structure
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Python networkx library quick start guide
Python networkx library quick start guidePython networkx library quick start guide
Python networkx library quick start guide
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In python
 
Roberto Trasarti PhD Thesis
Roberto Trasarti PhD ThesisRoberto Trasarti PhD Thesis
Roberto Trasarti PhD Thesis
 
Data structure
Data structureData structure
Data structure
 
Ghost
GhostGhost
Ghost
 
Empirical Semantics
Empirical SemanticsEmpirical Semantics
Empirical Semantics
 
Basic data analysis using R.
Basic data analysis using R.Basic data analysis using R.
Basic data analysis using R.
 
Incremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher QueriesIncremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher Queries
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Java Extension Methods
Java Extension MethodsJava Extension Methods
Java Extension Methods
 
Java Arrays and DateTime Functions
Java Arrays and DateTime FunctionsJava Arrays and DateTime Functions
Java Arrays and DateTime Functions
 
What is data structure
What is data structureWhat is data structure
What is data structure
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Introduction of data structure
Introduction of data structureIntroduction of data structure
Introduction of data structure
 

Semelhante a Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
Shubham Singh
 

Semelhante a Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval (20)

Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Cyber Threat Ranking using READ
Cyber Threat Ranking using READCyber Threat Ranking using READ
Cyber Threat Ranking using READ
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
3DIR: Exploiting Topological Relationships in Three-dimensional Information R...
3DIR: Exploiting Topological Relationships in Three-dimensional Information R...3DIR: Exploiting Topological Relationships in Three-dimensional Information R...
3DIR: Exploiting Topological Relationships in Three-dimensional Information R...
 
ProjectReport
ProjectReportProjectReport
ProjectReport
 
Effective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web CollectionsEffective Named Entity Recognition for Idiosyncratic Web Collections
Effective Named Entity Recognition for Idiosyncratic Web Collections
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Introduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionIntroduction to image processing and pattern recognition
Introduction to image processing and pattern recognition
 
Indexing data on the web a comparison of schema level indices for data search
Indexing data on the web a comparison of schema level indices for data searchIndexing data on the web a comparison of schema level indices for data search
Indexing data on the web a comparison of schema level indices for data search
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Deep Learning for Stock Prediction
Deep Learning for Stock PredictionDeep Learning for Stock Prediction
Deep Learning for Stock Prediction
 
2015 03-28-eb-final
2015 03-28-eb-final2015 03-28-eb-final
2015 03-28-eb-final
 
Introduction to machine_learning
Introduction to machine_learningIntroduction to machine_learning
Introduction to machine_learning
 
Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Pandas application
Pandas applicationPandas application
Pandas application
 

Mais de eXascale Infolab

HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
eXascale Infolab
 

Mais de eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

  • 1. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux eXascale Infolab - University of Fribourg - Switzerland {firstname.lastname}@unifr.ch SIGIR2012 - Monday, August 13th 2012
  • 2. 2 Motivation • Lot of search engines queries are about entities. • Increasingly large amount of entity data online. • Often represented as huge graphs • e.g. the LOD cloud, Google Knowledge Graph, Facebook social graph. • Globally unique Entity identifiers (e.g., URIs) . • Hard to discover and/or memorize.
  • 3. 3 Ad-hoc Object Retrieval (informal definition) • “Given the description of an entity, give me back its identifier” • Description can be keywords (e.g., “Harry Potter”). • More than one identifier per entity (e.g., dbpedia + freebase). • How to evaluate returned results?
  • 4. Ad-hoc Object Retrieval (formal definition by Pound et al.) • Input: unstructured query q and data graph G. • Output: ranked list of resource identifiers (URIs) from G. • Evaluation: results (URIs) scored by a judge with access to all the information contained in or linked to the resource. • Standard collections exist. + 1. http://ex.plode.us/tag/harry+potter 1. http://www.vox.com/explore/interests/harry%20potter 1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo ws/ 1. http://harrypotter.wizards.pro/ 1. http://ex.plode.us/tag/harry+potter 1. http://www.vox.com/explore/interests/harry%20potter 1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo ws/ 1. http://harrypotter.wizards.pro/ http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u k http://harrypotter.wizards.pro/ http://ebiquity.umbc.edu/person/html/Harry/Chen/ http://dbpedia.org/resource/Ceramist http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u k http://harrypotter.wizards.pro/ http://ebiquity.umbc.edu/person/html/Harry/Chen/ http://dbpedia.org/resource/Ceramist
  • 5. 5 Overview of Our Solution Inverted indices on the LOD Cloud... ...and RDF store containing the data. Simple NLP techniques, Autocompletion, Pseudo-relevance feedback BM25, BM25F
  • 6. 6 Pseudo-Relevance Feedback NLP techniques Query auto- completion A Simple Example SIGIRSIGIR Graph traversals Final ranking function 2. http://freebase.com/…/sigir 3. http://dbpedia.org/…/IRAQ … 1. http://dbpedia.org/…/SIGIR Which properties should we follow? How to rank new results? II + ranking function(s) 2. http://dbpedia.org/…/IRAQ 3. … … 1. http://dbpedia.org/…/SIGIR How to build the II?
  • 7. 7 Outline 1. Inverted Indices 2. Graph Based Entity Search 1. Object Properties vs Datatype Properties 2. Properties to Follow 3. Experimental Results 1. Experimental Setting 2. IR Techniques: Experimental Results 3. Evaluation of the Hybrid Approaches 4. Overhead of the Graph Traversal
  • 8. 8 1. Inverted Indices (IIs) • Simple inverted index: • index all literals attached to each node in the input graph. • “movie” http://…types/film→ • Structured inverted index with three fields: • URI - tokenized URIs identifying entities. • Label - manually selected datatype properties to textual descriptions of the entity (e.g., label, title, name, full- name, …). • Attributes - all other literals. BM25(F), query auto-completion, query extension, relevance 8
  • 9. 9 New URIs ... 2. Graph-Based Entity Search IR results ... ... N p1 p2 p_m p1 p2 p_m sim(e, q) > τ? ... Assign Scores 0.284 1.428 0.556 Merged Re- Ranked Results ... Take top-N docs. Follow links/properties and get new URIs. Filter new results by text similarity wrt the user query. Scoring functions: count sim > τ, avg sim > τ, Sum sim, Avg sim, Sum BM25 - ε
  • 10. 10 2. 1. Object Properties vs Datatype Properties • Object Properties: • connect different entities • explore all the graph • Datatype properties: • give additional info about entities • explore just the neighborhood of a node
  • 11. 11 2.2. properties to follow • RDF graph queried with SPARQL queries. • Scope 1 queries vs Scope 2 queries. • Set of predicates to follow selected using: • Common sense (e.g., sameAs) • Statistics from the data
  • 12. 12 properties to follow: Two Examples Entry point given by the II
  • 14. 14 3.1 Experimental Setting • SemSearch 2010 and 2011 testsets: • Billion Triple Challenge 2009 (BTC2009) • 1.3 billions RDF triples crawled from the LOD cloud. • 92 and 50 queries, respectively. • Evaluation of systems with depth-10 pooling by means of crowdsourcing. • Measures taken into consideration: Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), early Precision (P10)
  • 15. 15 Completing Relevance by Crowdsourcing Judgements • We obtained relevance judgments for unjudged entities in the top-10 results of our runs by using Amazon MTurk. • To be fair we used the same design and settings that were used for the AOR task of SemSearch.
  • 16. 16 3.2. IR Techniques: Experimental ResultsOur Baseline.
  • 17. 18 3.3. Evaluation of Hybrid Approaches N = 3, = 0,τ score = sumBM25 - ε
  • 18. 19 3.4. Overhead of the Graph traversal • Time in milliseconds needed for each part of the hybrid approaches. • Measures taken on a single machine with cold cache. Surprisingly small overhead (17% for best results).
  • 19. 20 Conclusions • AOR = “Given the description of an entity, give me back its identifier” • Disappointing results using simple IR techniques for AOR task. • Hybrid system for AOR: • combining classic IR techniques + structured database storing graph data. • Our evaluation shows that the new approach leads to significantly better results (up to +25% MAP over BM25 baseline). • For the best working configuration found, the overhead caused from the graph traversal part is limited (17% more than running the chosen baseline).
  • 20. 21 Thank you for your attention • You can find the new relevance judgments at http://diuf.unifr.ch/xi/HybridAOR. • More info at www.exascale.info. • In the following days you’ll find our paper, this presentation, and the new crowdsourced relevance judgements at www.exascale.info/AOR.

Notas do Editor

  1. lot of search engines queries are about entities (more than a half) there is the task...
  2. tell that literals are strings attached to some node
  3. just the only scoring function
  4. tell what same as is
  5. I dati sono un grafo , l ’ indice invertito ci dà un entry point e poi camminiam
  6. TREC like collection/testset depth 10 pooling tutti lo conoscono qui!
  7. Say that simple index is “ or ” , UL, LA, ULA is “ and ” Say disappointment with first result with BM25: we tried to do just II but didn ’ t work, and then we decided to go for graph… NO GOOGLE
  8. Compare JUST s_1 with s_2 (lower recall but higher precision)
  9. s2_3 doesn ’ t follow wikilinks. Indicies and database were resident in the machine. We didn ’ t focus on efficiency