Collection Ranking and Selection for Federated Entity Search
1. Collection Ranking & Selection
for Federated Entity Search
Krisztian Balog, Robert Neumayer, and Kjetil Nørvåg
Norwegian University of Science and Technology
19th International Symposium on String Processing and Information Retrieval (SPIRE 2012)
Cartagena de Indias, Colombia, October 2012
2. Motivation
- Entities are ubiquitous
Many information needs revolve around entities
(people, products, organisations, places, etc.)
- Growing amount of structured data
Again, organised around entities
- Entities are often searched by their name
If we know (heard of) it, we just ask for it by the name
3. Top searches of 2011*
Web search Mobile search
1. iPhone 1. iPhone 5
2. Casey Anthony 2. Powerball
3. Kim Kardashian 3. MLB
4. Katy Perry 4. Scrabble cheat
5. Jennifer Lopez 5. Casey Anthony
6. Lindsay Lohan 6. Hurricane Irene 2011
7. American Idol 7. Kim Kardashian
8. Jennifer Aniston 8. Translator
9. Japan earthquake 9. Amy Winehouse
10. Osama bin Laden 10. May 21, 2011 Rapture
* http://yearinreview.yahoo.com/
4. Top searches of 2011*
Web search Mobile search
1. iPhone 1. iPhone 5
2. Casey Anthony 2. Powerball
“users have learned that search engine relevance
3. Kim Kardashian 3. MLB
decreases with longer queries and have grown
4. Katy Perry 4. Scrabble cheat
accustomed to reducing their query Anthony
5. Jennifer Lopez 5. Casey (at least
initially) to the name of an entity”
6. Lindsay Lohan 6. Hurricane Irene 2011
7. American Idol 7. Kim Kardashian
8. Jennifer Aniston
[Blanco et al., 2011]
8. Translator
9. Japan earthquake 9. Amy Winehouse
10. Osama bin Laden 10. May 21, 2011 Rapture
* http://yearinreview.yahoo.com/
5. The Web of Data
# Linking Open Data datasets
300
200
100
0
2007 2008 2009 2010 2011
6. LOD in 2007
Fresh- ECS
South- SIOC
meat
ampton
BBC
Later + NEW!
Sem- NEW!
TOTP Web-
SW
Central Onto- Conference
world Corpus
Music- Magna- FOAF
brainz tune
Open-
Guides
Revyu
RDF Book
Jamendo Geo- Mashup
names
DBpedia
DBLP
Berlin
US
World NEW!
Census
Fact-
Data book NEW! lingvoj
Euro- flickr
stat wrappr
Wiki- Open DBLP
company Cyc Hannover
Gov-
Track Project
W3C Guten-
WordNet berg
7. LOD in 2011
Linked
LOV User Slideshare tags2con
Audio
Feedback 2RDF delicious
Moseley Scrobbler Bricklink Sussex
Folk (DBTune) Reading St.
GTAA
Magna- Lists Andrews
Klapp-
tune stuhl- Resource NTU
DB club Lists Resource
Tropes Lotico Semantic yovisto
John Music Man- Lists
Music Tweet chester
Hellenic Peel Brainz NDL
(DBTune) (Data Brainz Reading
subjects
FBD (zitgist) Lists Open
EUTC Incubator) Linked
Hellenic Library Open t4gm
Produc- Crunch-
PD Surge RDF info
tions
Discogs base Library
Radio Ontos Source Code
Crime ohloh Plymouth (Talis)
(Data News LEM
Ecosystem Reading RAMEAU
Reports business Incubator)
Crime data.gov. Portal Linked Data Lists SH
UK Music Jamendo
(En- uk
Brainz (DBtune) LinkedL
Ox AKTing) FanHubz gnoss ntnusc
(DBTune) SSW CCN
Points Thesau-
Last.FM Poké- Thesaur
Popula- artists pédia Didactal us rus W LIBRIS
tion (En- (DBTune) Last.FM ia theses. LCSH Rådata
reegle research patents MARC
AKTing) (rdfize) my fr nå!
data.gov. data.go Codes
Ren.
NHS uk v.uk Good- Experi-
Classical List
Energy (En- win flickr ment
(DB Pokedex Norwe-
Genera- AKTing) Mortality BBC Family wrappr Sudoc PSH
Tune) gian
(En-
tors Program MeSH
AKTing) semantic
mes BBC IdRef GND
CO2 educatio OpenEI web.org SW
Energy Sudoc ndlna
Emission n.data.g Music Dog VIAF
EEA (En- Chronic- Linked
(En- ov.uk Portu- Food UB
AKTing) ling Event MDB
AKTing) guese Mann- Europeana
BBC America Media
DBpedia Calames heim
Ord- Recht- Wildlife Deutsche
Open Revyu DDC
Openly spraak. Finder Bio- lobid
Election nance
legislation Local nl RDF graphie
Resources NSZL Swedish
Data Survey Tele- data Ulm
EU New Book
Project data.gov.uk graphis bnf.fr Catalog Open
Insti- York
URI Open Mashup Cultural
tutions Times Greek P20
UK Post- Burner Calais Heritage
codes DBpedia ECS Wiki
statistics lobid
GovWILD data.gov. Taxon iServe South- Organi-
LOIUS BNB
Brazilian
uk Concept ECS ampton sations
Geo World OS BibBase STW GESIS
Poli- ESD South- ECS
Names Fact- (RKB
ticians stan- reference ampton
data.gov.uk book Freebase Explorer) Budapest
dards data.gov. NASA EPrints
uk intervals Project OAI
Lichfield transport (Data DBpedia data
Guten- Pisa
Spen- data.gov. Incu- dcs RESEX Scholaro-
ISTAT ding bator) Fishes berg DBLP DBLP
uk Geo
meter
Immi- Scotland of Texas (FU (L3S)
Pupils & Uberblic DBLP
gration Species Berlin) IRIT
Exams Euro- dbpedia data- (RKB
London TCM ACM
stat lite open- Explorer) NVD
Gazette (FUB) Gene IBM
Traffic Geo ac-uk
Scotland TWC LOGD Eurostat Daily DIT
Linked UN/
Data UMBEL Med ERA
Data LOCODE DEPLOY
Gov.ie CORDIS YAGO New-
lingvoj Disea-
(RKB some SIDER RAE2001 castle LOCAH
CORDIS Explorer) Linked Eurécom
Eurostat Drug CiteSeer Roma
(FUB) Sensor Data
GovTrack (Ontology (Kno.e.sis) Open Bank Pfam Course-
Central) riese Enipedia
Cyc Lexvo LinkedCT ware
Linked PDB
UniProt VIVO
EURES EDGAR dotAC
US SEC Indiana ePrints IEEE
(Ontology totl.net
(rdfabout)
Central) WordNet RISKS
(VUA) Taxono UniProt
US Census EUNIS Twarql HGNC
Semantic Cornetto (Bio2RDF)
(rdfabout) my VIVO
FTS XBRL PRO- ProDom STITCH Cornell LAAS
SITE KISTI NSF
Scotland
Geo- GeoWord LODE
graphy Net WordNet WordNet JISC
(W3C) (RKB
Climbing
Linked Affy- KEGG
SMC Explorer) SISVU Pub VIVO UF
Piedmont GeoData metrix Drug
ECCO-
Finnish Journals PubMed Gene SGD Chem
Munici-
Accomo- El AGROV Ontology TCP Media
dations Alpine bible
palities Viajero OC
Ski ontology
Tourism KEGG
Ocean
Austria
Enzyme PBAC Geographic
Metoffice GEMET ChEMBL
Italian Drilling OMIM KEGG
Weather Open
public Codices AEMET Linked MGI Pathway
schools Forecasts
Data
Open InterPro GeneID Publications
EARTh Thesau- KEGG
Turismo
rus Colors Reaction
de
Zaragoza Product Smart KEGG
User-generated content
Weather DB Link Medi Glycan
Janus Stations Product Care KEGG
AMP UniParc UniRef UniSTS Government
Types Italian
Homolo Com-
Yahoo! Airports Museums pound
Ontology Google
Gene
Geo Art
Planet National
wrapper
Chem2 Cross-domain
Radio- Bio2RDF
activity UniPath
JP Sears Open Linked OGOLOD way
Life sciences
Corpo- Amster- Reactome
dam medu- Open
rates Numbers
Museum cator
As of September 2011
8. Ad-hoc entity search
At the 2010/11 Semantic Search Challenge
- Task
Given a keyword query, targeting a particular entity,
provide a ranked list of relevant entities (i.e., URIs)
- Queries
Sampled from web search engine logs (142 in total)
- Data collection
Billion Triple Challenge 2009 (BTC) dataset
- Relevance judgments
On a 3-point scale, collected using crowdsourcing
9. In this talk
- Address the ad-hoc entity retrieval task in a
distributed setting
- The Web of Data is inherently distributed
- Some data sources may not be crawleable at all
- Specifically, our focus is on the collection ranking
and collection selection steps
10. Federated search
A typical broker-based architecture
1 Collection ranking Central broker
Summary A
Summary B
Summary C
Q 1 A
C
B
11. Federated search
A typical broker-based architecture
1 Collection ranking Central broker
2 Collection selection
Summary A
Summary B
Summary C Collection A
Q
Q 1 A
Collection B
C
2
B
Q
Collection C
12. Federated search
A typical broker-based architecture
1 Collection ranking Central broker
2 Collection selection
3 Result merging Summary A
Summary B
Summary C Collection A
Q
Q 1 A
Collection B
C
2
B
Q
3 Collection C
13. Next: baseline models
for collection ranking and selection
1 Collection ranking Central broker
2 Collection selection
3 Result merging Summary A
Summary B
Summary C Collection A
Q
Q 1 A
Collection B
C
2
B
Q
3 Collection C
14. Collection ranking
Collection-centric (CC)
- Lexicon-based method
Treat and score each collection as if it was a single,
large document
Collection A
Q Collection C
Collection B
... ...
Y
P (c|q) / P (c) · P (t|✓c )
t2q
15. Collection ranking
Entity-centric (EC)
- Document-surrogate method
Model and query individual documents (entities) and
aggregate their relevance scores
Collection A
Collection C
Q
Collection B
...
...
X
P (c|q) / P (e|q)
e2c,r(e,q)<
17. Our method: AENN
“All that an Entity Needs is a Name”
- Exploit that entities are searched by their name
- The central broker maintains a complete dictionary
of entity names (and corresponding identifiers)
- Utilise this information in the collection selection step
to dynamically adjust the #collections selected
18. AENN collection ranking
- Key observation
Different methods—collection-centric (CC) vs. entity-
centric (EC)—work best for different queries
- Idea
Combination should give better results than any of the
two methods alone
AENN(c, q) = (1 ) · CC(c, q) + · EC(c, q)
19. AENN collection ranking
Example
CC EC AENN
A 0.35 A 0.65 A 0.5
D 0.3 B 0.3 B 0.2
C 0.2 C 0.05 D 0.15
E 0.15 C 0.125
B 0.1 E 0.075
F 0.05 F 0.025
20. AENN collection selection
- Key observation
CC has higher recall, while EC has better precision
- Idea
Use the collection rankings generated by EC and/or
CC to dynamically adjust the set of collections selected
- Precision-oriented selection
- Recall-oriented selection
- Balanced selection
21. AENN collection selection
Precision-oriented selection (AENN(p))
- Only select collections returned by EC
CC EC AENN AENN(p)
A 0.35 A 0.65 A 0.5 A 0.5
D 0.3 B 0.3 B 0.2 B 0.2
C 0.2 C 0.05 D 0.15 C 0.125
E 0.15 C 0.125
B 0.1 E 0.075
F 0.05 F 0.025
22. AENN collection selection
Recall-oriented selection (AENN(r))
- Include collections from CC until all from EC are
covered. This defines the cutoff point for AENN
CC EC AENN AENN(r)
A 0.35 A 0.65 A 0.5 A 0.5
D 0.3 B 0.3 B 0.2 B 0.2
C 0.2 C 0.05 D 0.15 D 0.15
E 0.15 C 0.125 C 0.125
B 0.1 E 0.075 E 0.075
F 0.05 F 0.025
23. AENN collection selection
Balanced selection (AENN(b))
- Include collections from AENN until all from EC
are covered
CC EC AENN AENN(b)
A 0.35 A 0.65 A 0.5 A 0.5
D 0.3 B 0.3 B 0.2 B 0.2
C 0.2 C 0.05 D 0.15 D 0.15
E 0.15 C 0.125 C 0.125
B 0.1 E 0.075
F 0.05 F 0.025
24. AENN collection selection
Comparison of approaches
AENN(p) AENN(r) AENN(b)
A 0.5 A 0.5 A 0.5
B 0.2 B 0.2 B 0.2
C 0.125 D 0.15 D 0.15
C 0.125 C 0.125
E 0.075
25. Experimental setup
Based on the 2010/11 Semantic Search Challenge
- Distributed environment
Top 100 largest second-level domains from BTC
- Three sets with different handling of DBpedia
- Relevance
Considered the #relevant entities from each collection
- Metrics
- Collection ranking: Standard IR metrics (MAP, MRR, nDCG)
- Collection selection: Analogues of precision and recall, plus
the avg. #coll. selected
30. Summary
- Addressed the task of ad-hoc entity retrieval in a
distributed setting
- Introduced AENN, a novel collection ranking and
selection method based on a lean name-based
entity representation
- Showed experimentally that AENN can outperform
standard baselines that consider all entity content
- Further, AENN can be geared towards high precision,
high recall, or a balanced setting
This is how the Linking Open Data cloud looked like sometime in 2007\n
And this is how it looks like more-or-less now.\n
Main lessons learned: fielded variants of standard document retrieval methods work well on this task\nAll approaches, however, assumed that a centralised index encompassing all entities is available\n
\n
As I said earlier, existing approaches build document-based representations of entities. Therefore, federated search techniques for document retrieval can easily be applied.\n
\n
\n
\n
\n
\n
\n
Acronym comes from the very insightful observation that ...\n\nPrevious methods rely on sampling\n\n\n
Note that the central broker contains only the names of entities\n
Ok, let&#x2019;s familiarise ourselves with these acronyms, because we&#x2019;ll be using them a lot in the following slides.\n
\n
Recall that EC operates on a local ranking of entities; we expect this to be the final list of results\nConsequently, only collection that contribute to this list will be selected.\n
We start at the top of the CC ranking and keep including collections until we have all from EC covered. \n
We start at the top of the CC ranking and keep including collections until we have all from EC covered. \n
\n
No standard test collection exists for this task.\n
\n
\n
\n
CC-N: lower bound\nEC-C: upper bound (essentially, a centralised index of all entities with full content)\n