Mapping Keywords to

Mapping Keywords to Linked Data Resources
for
Automatic Query Expansion
Isabelle Augenstein1
Anna Lisa Gentile1
Barry Norton2
Ziqi Zhang1
Fabio Ciravegna1
1
Department of Computer Science, University of Shefﬁeld, UK
2
Ontotext, UK
{i.augenstein,a.l.gentile,z.zhang,f.ciravegna}@dcs.shef.ac.uk,
barry.norton@ontotext.com
May 26, 2013
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 1 / 21

Motivation
In order to consume Linked Data, end users need to be
familiar with RDF’s data model and query language SPARQL
have knowledge of datasets and their contents
Keyword search is a means to overcome these barriers
Map keywords to RDF resources
“ﬁlm” → dbpedia-owl:Film

Motivation
Challenges for keyword search:
Spelling mistakes, e.g. “flim”
Lexical derivations, e.g. “films”
Synonyms, e.g. “movie”
How to address these challenges? → query expansion
What is query expansion?
Process of expanding a seed query with additional query terms
to improve recall
“film”, “flim”, “films”, “movie” → dbpedia-owl:Film

State of the art: How to find mappings
String similarity (Sindice [4])
“flim”, “films” → dbpedia-owl:Film
Domain-independent approach
Problem: Only does expansion for spelling mistakes and
lexically similar words
Dictionary-based methods (WordNet [3], Wikipedia [2])
“movie” → dbpedia-owl:Film
Approach finds synonyms
Problem: Dictionaries contain limited, domain-specific
vocabulary, WordNet has very few named entities

State of the art: How to rank resources
String similarity
Rank by decreasing String similarity (SearchWebDB [5])
Rank by decreasing tf-idf (Falcons Object Search [1])
Dictionary-based methods (WordNet ([3]), Wikipedia ([2]))
Rank by decreasing semantic similarity (ESA [2])
Combine frequency and conﬁdence (PowerAqua [3])

Our approach
How to find mappings
In order to have a highly adaptable, domain-independent
approach, use knowledge contained within the dataset
Benefit from properties between resources in Linked Data,
which are potentially useful for finding semantically similar
keywords
How to rank resources
Follow intuition of state of the art methods
Combine tf-idf and confidence of our measure

Method: Overview
Scenario: User wants to ﬁnd representative Linked Data
resources for keyword w
Example: “movie” → dbpedia-owl:Film
Dataset: Set of triples consisting of a ‘subject’, ‘predicate’ and
an ‘object’.
Current application: Find classes and properties

Method: Overview
Step 1: For each keyword w, learn an expanded set of
keywords Ew and a ranking of Ew to find semantically similar
words in the target vocabulary
Example: “movie” → “movie”’, “film”, “films”
Step 2: Identify concepts through labelling properties
“movie” → ∅
“film” → dbpedia-owl:Film
“films” → ∅
Step 3: Rank resulting concepts and return concept with
highest rank
“movie” → dbpedia-owl:Film

Method: Overview
Set of labelling properties to identify concepts
rdfs:label
foaf:name
dc:title
skos:prefLabel
skos:altLabel
fb:type.object.name
Table : Labelling properties

Method: Candidate Identiﬁcation
Step 1: For each keyword w, learn an expanded set of keywords Ew
and a ranking of Ew
Step 1.1: Which properties are useful in expressing semantic
similarity?
Get resource r for keyword using labelling properties
“movie” → fb:Movie
Get all triples where r is a subject
fb:Movie wn:containsWordSense dbpedia-owl:show
fb:Movie dc:description ‘‘movie director’’
fb:Movie commontag:label ‘‘Film’’
For each of the objects of the triples, ﬁnd the labels and
tokenise them
“movie director” → “movie”, “director”
Find resources for them in the target vocabulary
“show” → dbpedia-owl:show
“Film” → dbpedia-owl:Film

Method: Training
Step 1.2: How to learn a ranking for these properties?
Select a list of keywords, run step 1.1 for each of them
Manually chose the best resources among the candidates
“show”, “Film” → “Film”
Produce a precision measure for every property used to find
candidates
prec(p) = w∈
−−−−→
Wtrain
hits(p, w)
w∈
−−−−→
Wtrain
candidates(p, w)
Define treshold 0 ≤ θ ≤ 1, use prec to define an ordered
(ranked) subset of properties to encode semantic relatedness

Method: Test
Test set of keywords
−−−−→
Wtest
, apply algorithm to obtain
candidates
Only take subset of properties instead of all properties to ﬁnd
candidates
Combine, as a numerical product, the precision of the property,
p, used to ﬁnd that candidate and a tf-idf score for r

Evaluation: Gold standard and Metric
Used DBpedia ontology as the target vocabulary and Sindice
cache as dataset for computing expanded set of keywords Ew
Gold standard from Freitas et al. [2], contains 178 keywords of
which 134 have a representation in the DBpedia ontology
Corrected minor errors in gold-standard and re-evaluated
approach by Freitas et al. [2]
Actual ﬁgures don’t change signiﬁcantly
Use Mean Reciprocal Rank (MRR) which measures the quality
of the ranking by calculating the inverse rank of the best result

Evaluation: Approach
Manually create training set consisting of 40 keywords for
supervised training phase
Result of training phase: 194 candidate properties
Use precision threshold of of 0.045 to cut off the candidate
property list. This resulted in 23 properties.

Evaluation: Result of Training
foaf:name 0.267 owl:sameAs 0.121
fb-common:topic 0.189 wn20schema:gloss 0.1
rdfs:label 0.187 opencyc:seeAlsoURI 0.093
dc:subject 0.182 rdfs:seeAlso 0.082
dc:title 0.169 dbpedia-owl:abstract 0.0774
sindice:label 0.168 rdfs:comment 0.0676
rdfs:suBClassOf 0.168 rdfs:range 0.0667
skos:prefLabel 0.157 rdfs:subClassOf 0.0656
fb-type:object 0.143 fb:documented object 0.0619
wn20schema:derivationallyRelated 0.143 dbpedia-owl:wikiPageWikiLink 0.0487
wn20schema:containsWordSense 0.138 dc:description 0.0471
commontag:label 0.133
Table : Top 23 properties used and their precision

Evaluation: Example results of Test
Query: [spacecraft] Query: [engine] Query: [factory]
dbpedia-owl:Spacecraft dbpedia-owl:engine dbpedia-owl:manufacturer
dbpedia-owl:spacecraft dbpedia-owl:gameEngine dbpedia-owl:plant
dbpedia-owl:satellite dbpedia-owl:Artwork dbpedia-owl:Canal
dbpedia-owl:missions dbpedia-owl:AutomobileEngine dbpedia-owl:engine
dbpedia-owl:launches dbpedia-owl:Locomotive dbpedia-owl:class
dbpedia-owl:closed dbpedia-owl:fuel dbpedia-owl:Album
dbpedia-owl:vehicle dbpedia-owl:added dbpedia-owl:product
dbpedia-owl:Rocket dbpedia-owl:Musical dbpedia-owl:assembly
Query: [bass] Query: [wife] Query: [honda]
dbpedia-owl:Fish dbpedia-owl:spouse dbpedia-owl:manufacturer
dbpedia-owl:Instrument dbpedia-owl:Criminal dbpedia-owl:discovered
dbpedia-owl:instrument dbpedia-owl:person dbpedia-owl:Asteroid
dbpedia-owl:voice dbpedia-owl:status dbpedia-owl:engine
dbpedia-owl:partner dbpedia-owl:education dbpedia-owl:season
dbpedia-owl:note dbpedia-owl:Language dbpedia-owl:vehicle
dbpedia-owl:Musical dbpedia-owl:sex dbpedia-owl:Automobile
dbpedia-owl:lowest dbpedia-owl:family dbpedia-owl:participant

Evaluation: Results of Test
Model MRR
ESA 0.6
LOD Keyword Expansion 0.77
Table : Mean reciprocal rank (MRR)
Model Recall
Strich match 0.45
String match + WordNet 0.52
ESA 0.87
LOD Keyword Expansion 0.90
Table : Percentage queries answered (Recall)

Conclusion
Method for automatic query expansion for Linked Data
resources based on using properties between resources within
the Linked Open Data cloud
“film”, “flim”, “films”, “movie” → dbpedia-owl:Film
Evaluation showed how useful these different properties are for
finding semantic similarities and thereby finding expanded
keywords
Improvement of 17% in MRR over state of the art

Future Work
Related work ([6]) shows that best results can be achieved by
following a multi-strategy approach (String similarity + WordNet
+ ESA), which we could also integrate our approach in
Adjust labelling properties (add the ones discovered in training)
Treat labelling properties seperately from other properties
Fine-tune tokenisation of literals
Perform in vivo evaluation of the task, e.g. in the context of
question answering
Reevaluate for instances

Bibliography I
Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing
entities on the semantic web. In: Proceedings of the 17th
international conference on World Wide Web. pp. 1101–1102.
ACM (2008)
Freitas, A., Curry, E., Oliveira, J.G., O’Riain, S.: A distributional
structured semantic space for querying rdf graph data.
International Journal of Semantic Computing 5(04), 433–462
(2011)
Lopez, V., Nikolov, A., Fernandez, M., Sabou, M., Uren, V.,
Motta, E.: Merging and ranking answers in the semantic web:
The wisdom of crowds. The semantic web pp. 135–152 (2009)

Bibliography II
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H.,
Tummarello, G.: Sindice. com: a document-oriented lookup
index for open linked data. International Journal of Metadata,
Semantics and Ontologies 3(1), 37–52 (2008)
Tran, T., Wang, H., Haase, P.: Searchwebdb: Data web search
on a pay-as-you-go integration infrastructure. Tech. rep.,
Technical report, University of Karlsruhe (2008)
Walter, S., Unger, C., Cimiano, P., B¨ar, D.: Evaluation of a
layered approach to question answering over linked data. In:
The Semantic Web–ISWC 2012, pp. 362–374. Springer (2012)

Mapping Keywords to

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Mapping Keywords to

Semelhante a Mapping Keywords to (20)

Mais de Isabelle Augenstein

Mais de Isabelle Augenstein (17)

Último

Último (20)

Mapping Keywords to