The increasing amount of valuable semi-structured data has become available online. In this talk, we overview the state of the art in entity ranking over structured data ("linked data").
3. Motivation
The increasing amount of valuable semi-structured
data has become available online, e.g.
RDF graphs: Linking Open Data (LOD) cloud
Web pages enhanced with microformats, RDFa
etc.: CommonCrawl, Web Data Commons
Google: Freebase Annotations of the ClueWeb
Corpora
More than a half of queries from real query logs
have the entity-centric user intent
Examples from industry: Google Knowledge Graph,
Facebook Graph Search, Yandex Islands ⇒
3 / 60
9. In this talk, we focus on entity ranking over RDF
graphs given a keyword search query
9 / 60
10. Key Issues in Entity Ranking
Ambiguity in names
Related entities from heterogeneous
data sources
Complex queries with clarifying terms
10 / 60
11. Key Issues in Entity Ranking
Ambiguity in names
Given a query university of michigan,
University of Michigan, Ann Arbor
Central Michigan University, Michigan
Technological University, Michigan State
University
11 / 60
12. Key Issues in Entity Ranking
Related entities from heterogeneous data sources
Given a query harry potter movie,
Semantic link information can effectively enhance term
context
12 / 60
13. Key Issues in Entity Ranking
Complex queries with clarifying terms
Given a query shobana
masala, the user intent is
likely about Shobana
Chandrakumar, an Indian
actress starring in movies of
the Masala genre
13 / 60
14. Ad-hoc Object Retrieval in the Web of Data
Jeffrey Pound, Peter Mika, Hugo Zaragoza
WWW 2010
14 / 60
15. Query Categories
Entity query (∼ 40%∗), e.g. 1978 cj5
jeep
Type query† (∼ 12%), e.g. doctors in
barcelona
Attribute query (∼ 5%), e.g. zip code
atlanta
Other query (∼ 36%)
however, ∼ 14% of them contain a context
entity or type
∗
estimated on real query logs from Yahoo!
†
a.k.a. list search query
15 / 60
16. Repeatable and Reliable
Search System Evaluation
using Crowdsourcing
Roi Blanco, Harry Halpin, Daniel M. Herzig,
Peter Mika, Jeffrey Pound, Henry S. Thompson,
Thanh D. Tran
SIGIR 2011
16 / 60
17. Data Collection
Billion Triples Challenge 2009 RDF data set
The size of uncompressed data is 247GB;
1.4B triples describing 114 million objects
It was composed by combining crawls of
multiple RDF search engines
17 / 60
21. Query Set Preparation
1
Emulate top queries
Given Microsoft Live Search log containing
queries repeated by at least 10 different users
Sample 50 queries prefiltered with a NER and
a gazetteer
2
Emulate long-tailed queries
Given Yahoo! Search Query Log Tiny Sample
v1.0 – 4,500 queries
Sample and manually filter out ambiguous
queries ⇒ 42 queries
3
⇒ a list of 92 queries
21 / 60
22. Crowdsourcing Judgements
A purpose-built rendering tool to present
the search results
There have been conducted the evaluation
(MT1) and its repetition(MT2) after 6
months
Using Amazon Mechanical Turk HITs
Each HIT consists of 12 query-result pairs:
10 real ones and 2 were from "golden
standard" annotated by experts
64 workers for MT1 and 69 workers for MT2
22 / 60
25. Targeting Evaluation Measures I
All the measures are usually computed on top-10 search
results (k=10)
1
P@k (precision at k):
P @k(π, l) =
2
t≤k I{lπ(k) =1}
k
MAP (mean average precision):
AP (π, l) =
m
k=1 P @k
· I{lπ(k) =1}
m1
MAP = mean of AP over all queries
25 / 60
26. Targeting Evaluation Measures II
3
NDCG: normalized discounted cumulative gain
k
G(lπ(j) ) · η(j),
DCG@k(π, l) =
j=1
where G(·), the rating of a document, is usually
1
G(z) = 2z − 1, η(j) = log(j+1) , lπ(j) ∈ {0, 1, 2}
N DCG@k(π, l) =
1
DCG@k(π, l)
Zk
26 / 60
27. Analysis of Results
Reliability
Metric Difference
MAP
1.8%
NDCG 3.5%
P@10
12.8%
In the setting, experts rate more results
negative than workers
P@10 is more fragile than MAP and NDCG
27 / 60
30. Entity Search Track Submission by
Yahoo! Research Barcelona
Roi Blanco, Peter Mika, Hugo Zaragoza
SSW at WWW 2010
30 / 60
31. YSC 2010 Winner Approach
RDF S-P-O triples with literals are only considered
Triples are filtered by predicates from a predefined
list of 300 predicates
Triples about the same subject are grouped into a
pseudo document with multiple fields
BM25F ranking formula is applied (the weighting
scheme wc is handcrafted):
BM 25F =
t∈q∩d
tf (t, d)
· idf (t),
k1 + b ∗ tf (t, d)
wc · tfc (t, d)
tf (t, d) =
c∈d
31 / 60
32. Sindice BM25MF at SemSearch 2011
Stephane Campinas, Renaud Delbru, Nur A. Rakhmawati,
Diego Ceccarelli, Giovanni Tummarello
SSW at WWW 2011
32 / 60
33. YSC 2011 Winner Approach I
URI resolution for triple objects
Extended BM25F approach with additional
normalization for term frequencies per
predicate types:
The weighting scheme is handcrafted
The proportion of query terms in entity
literals
33 / 60
39. Approach to entity representation II
a) Unstructured Entity Model; b) Structure Entity Model:
39 / 60
40. Main Findings
Two generative language models (LMs) for
the task:
Unstructured Entity Model
Structured Entity Model
The evaluation on the YSC data shows that
the representation of relations as a mixture
of predicate type LMs can contribute
significantly to overall performance
40 / 60
41. LM Retrieval Framework
P (q|e)P (e) rank
= P (q|e)P (e),
P (q)
where P (e|q) - probability of being relevant given query q
P (e|q) =
Further Assumptions
(i) P (e) is uniform; (ii) query terms are i.i.d
Let θe be the entity model that predicts how likely the
entity would produce a given term t, then
the query likelihood is
P (t|θe )tf (t,q)
P (q|θe ) =
t∈q
41 / 60
42. Unstructured Entity Model
Idea
Collapse all text values of properties associated
with the entity into a single document and apply
standard IR techniques
The entity model is a Dirichlet-smoothed
multinomial distribution:
P (t|θe) =
tf (t, e) + µP (t|θc)
|e| + µ
42 / 60
43. Structured Entity Model
Folding Predicates
Group RDF triples by the following predicate types pt :
Name, e.g. literal values of foaf:name, rdfs:label
Attributes, i.e. remaining datatype properties
OutRelations: resolving "object" (O) URIs in S-P-O
triple getting their names
InRelations: resolving "subject" (S) URIs in S-P-O
triple getting their names
43 / 60
44. Structured Entity Model
Mixture of Language Models
p
Each group has its own LM P (t|θe t ):
p
P (t|θe t )
p
tf (t, pt, e) + µpt P (t|θc t )
=
|pt, e| + µpt
Then, the entity model is a linear mixture of the
predicate type LMs:
p
P (t|θe t )P (pt)
P (t|θe) =
pt
44 / 60
51. Structured Inverted Index
Consider the following property values as fields:
URI: tokens from entity URI, e.g. http:
//dbpedia.org/page/Barack_Obama
⇒ ’barack’, ’obama’ etc.
Labels: values of a list of manually selected
datatype properties
Attributes: other properties
BM25F is used as a ranking function
51 / 60
52. Graph-based Entity Search
1
2
3
4
Given a query q, obtain a list of entities
Retr = {e1 , e2 , . . . , en } ranked by the BM25F
scores
Use top-N elements as seeds for graph traversal
To get StructRetr = {e1 , . . . , em }, exploit
promising LOD properties‡ as well as Jaro-Winkler
string similarity scores JW (q, e ) > τ
Combine two rankings:
f inalScore(q, e ) = λ × BM 25(q, e) + (1 − λ) × JW (q, e )
‡
owl:sameAs, dbpedia:disambiguates, dbpedia:redirect
52 / 60
53. Evaluation
The graph-based approach (S1_1) outperforms BM25
scoring with 25% improvement of MAP on the 2010 data set
No significant improvement over baseline on the 2011 data
set
This may be explained by the lack of the used predicates
(owl:sameAs volume < 0.7%)
53 / 60
54. Improving Entity Search over Linked Data
by Modeling Latent Semantics
Nikita Zhiltsov, Eugene Agichtein
CIKM 2013
54 / 60
55. Key Contributions
A tensor factorization based approach to incorporate
semantic link information into ranking model
Outperforms the state of the art baseline in
NDCG/MAP/P@10
A thorough evaluation of the proposed techniques
by acquiring thousands of manual labels to augment
the YSC benchmark data set
⇒ more details in the next talk
55 / 60
57. Negative Results
The ideas from standard IR that do not work out:
Wordnet-based query expansion [Tonon et al.,
SIGIR 2012]
Pseudo-relevance feedback [Tonon et al., SIGIR
2012]
Query suggestions of a commercial search engine
[Tonon et al., SIGIR 2012]
Direct application of centrality measures, such as
PageRank and HITS [Campinas et al., SSW WWW
2010; Dali et al., 2012]
57 / 60
59. Wrap up
Entity search over RDF graphs a.k.a. ad-hoc object
retrieval has emerged as a new task in IR
There is a robust and consistent evaluation
methodology for it
State-of-the-art approaches revolve around
applications of well-known IR methods along
Lack of approaches for leveraging semantic links
Lots of data: scalability really matters
59 / 60