Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Mapping Keywords to
1. Mapping Keywords to Linked Data Resources
for
Automatic Query Expansion
Isabelle Augenstein1
Anna Lisa Gentile1
Barry Norton2
Ziqi Zhang1
Fabio Ciravegna1
1
Department of Computer Science, University of Sheffield, UK
2
Ontotext, UK
{i.augenstein,a.l.gentile,z.zhang,f.ciravegna}@dcs.shef.ac.uk,
barry.norton@ontotext.com
May 26, 2013
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 1 / 21
2. Motivation
In order to consume Linked Data, end users need to be
familiar with RDF’s data model and query language SPARQL
have knowledge of datasets and their contents
Keyword search is a means to overcome these barriers
Map keywords to RDF resources
“film” → dbpedia-owl:Film
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 2 / 21
3. Motivation
Challenges for keyword search:
Spelling mistakes, e.g. “flim”
Lexical derivations, e.g. “films”
Synonyms, e.g. “movie”
How to address these challenges? → query expansion
What is query expansion?
Process of expanding a seed query with additional query terms
to improve recall
“film”, “flim”, “films”, “movie” → dbpedia-owl:Film
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 3 / 21
4. State of the art: How to find mappings
String similarity (Sindice [4])
“flim”, “films” → dbpedia-owl:Film
Domain-independent approach
Problem: Only does expansion for spelling mistakes and
lexically similar words
Dictionary-based methods (WordNet [3], Wikipedia [2])
“movie” → dbpedia-owl:Film
Approach finds synonyms
Problem: Dictionaries contain limited, domain-specific
vocabulary, WordNet has very few named entities
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 4 / 21
5. State of the art: How to rank resources
String similarity
Rank by decreasing String similarity (SearchWebDB [5])
Rank by decreasing tf-idf (Falcons Object Search [1])
Dictionary-based methods (WordNet ([3]), Wikipedia ([2]))
Rank by decreasing semantic similarity (ESA [2])
Combine frequency and confidence (PowerAqua [3])
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 5 / 21
6. Our approach
How to find mappings
In order to have a highly adaptable, domain-independent
approach, use knowledge contained within the dataset
Benefit from properties between resources in Linked Data,
which are potentially useful for finding semantically similar
keywords
How to rank resources
Follow intuition of state of the art methods
Combine tf-idf and confidence of our measure
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 6 / 21
7. Method: Overview
Scenario: User wants to find representative Linked Data
resources for keyword w
Example: “movie” → dbpedia-owl:Film
Dataset: Set of triples consisting of a ‘subject’, ‘predicate’ and
an ‘object’.
Current application: Find classes and properties
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 7 / 21
8. Method: Overview
Step 1: For each keyword w, learn an expanded set of
keywords Ew and a ranking of Ew to find semantically similar
words in the target vocabulary
Example: “movie” → “movie”’, “film”, “films”
Step 2: Identify concepts through labelling properties
“movie” → ∅
“film” → dbpedia-owl:Film
“films” → ∅
Step 3: Rank resulting concepts and return concept with
highest rank
“movie” → dbpedia-owl:Film
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 8 / 21
9. Method: Overview
Set of labelling properties to identify concepts
rdfs:label
foaf:name
dc:title
skos:prefLabel
skos:altLabel
fb:type.object.name
Table : Labelling properties
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 9 / 21
10. Method: Candidate Identification
Step 1: For each keyword w, learn an expanded set of keywords Ew
and a ranking of Ew
Step 1.1: Which properties are useful in expressing semantic
similarity?
Get resource r for keyword using labelling properties
“movie” → fb:Movie
Get all triples where r is a subject
fb:Movie wn:containsWordSense dbpedia-owl:show
fb:Movie dc:description ‘‘movie director’’
fb:Movie commontag:label ‘‘Film’’
For each of the objects of the triples, find the labels and
tokenise them
“movie director” → “movie”, “director”
Find resources for them in the target vocabulary
“show” → dbpedia-owl:show
“Film” → dbpedia-owl:Film
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 10 / 21
11. Method: Training
Step 1.2: How to learn a ranking for these properties?
Select a list of keywords, run step 1.1 for each of them
Manually chose the best resources among the candidates
“show”, “Film” → “Film”
Produce a precision measure for every property used to find
candidates
prec(p) = w∈
−−−−→
Wtrain
hits(p, w)
w∈
−−−−→
Wtrain
candidates(p, w)
Define treshold 0 ≤ θ ≤ 1, use prec to define an ordered
(ranked) subset of properties to encode semantic relatedness
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 11 / 21
12. Method: Test
Test set of keywords
−−−−→
Wtest
, apply algorithm to obtain
candidates
Only take subset of properties instead of all properties to find
candidates
Combine, as a numerical product, the precision of the property,
p, used to find that candidate and a tf-idf score for r
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 12 / 21
13. Evaluation: Gold standard and Metric
Used DBpedia ontology as the target vocabulary and Sindice
cache as dataset for computing expanded set of keywords Ew
Gold standard from Freitas et al. [2], contains 178 keywords of
which 134 have a representation in the DBpedia ontology
Corrected minor errors in gold-standard and re-evaluated
approach by Freitas et al. [2]
Actual figures don’t change significantly
Use Mean Reciprocal Rank (MRR) which measures the quality
of the ranking by calculating the inverse rank of the best result
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 13 / 21
14. Evaluation: Approach
Manually create training set consisting of 40 keywords for
supervised training phase
Result of training phase: 194 candidate properties
Use precision threshold of of 0.045 to cut off the candidate
property list. This resulted in 23 properties.
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 14 / 21
15. Evaluation: Result of Training
foaf:name 0.267 owl:sameAs 0.121
fb-common:topic 0.189 wn20schema:gloss 0.1
rdfs:label 0.187 opencyc:seeAlsoURI 0.093
dc:subject 0.182 rdfs:seeAlso 0.082
dc:title 0.169 dbpedia-owl:abstract 0.0774
sindice:label 0.168 rdfs:comment 0.0676
rdfs:suBClassOf 0.168 rdfs:range 0.0667
skos:prefLabel 0.157 rdfs:subClassOf 0.0656
fb-type:object 0.143 fb:documented object 0.0619
wn20schema:derivationallyRelated 0.143 dbpedia-owl:wikiPageWikiLink 0.0487
wn20schema:containsWordSense 0.138 dc:description 0.0471
commontag:label 0.133
Table : Top 23 properties used and their precision
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 15 / 21
17. Evaluation: Results of Test
Model MRR
ESA 0.6
LOD Keyword Expansion 0.77
Table : Mean reciprocal rank (MRR)
Model Recall
Strich match 0.45
String match + WordNet 0.52
ESA 0.87
LOD Keyword Expansion 0.90
Table : Percentage queries answered (Recall)
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 17 / 21
18. Conclusion
Method for automatic query expansion for Linked Data
resources based on using properties between resources within
the Linked Open Data cloud
“film”, “flim”, “films”, “movie” → dbpedia-owl:Film
Evaluation showed how useful these different properties are for
finding semantic similarities and thereby finding expanded
keywords
Improvement of 17% in MRR over state of the art
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 18 / 21
19. Future Work
Related work ([6]) shows that best results can be achieved by
following a multi-strategy approach (String similarity + WordNet
+ ESA), which we could also integrate our approach in
Adjust labelling properties (add the ones discovered in training)
Treat labelling properties seperately from other properties
Fine-tune tokenisation of literals
Perform in vivo evaluation of the task, e.g. in the context of
question answering
Reevaluate for instances
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 19 / 21
20. Bibliography I
Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing
entities on the semantic web. In: Proceedings of the 17th
international conference on World Wide Web. pp. 1101–1102.
ACM (2008)
Freitas, A., Curry, E., Oliveira, J.G., O’Riain, S.: A distributional
structured semantic space for querying rdf graph data.
International Journal of Semantic Computing 5(04), 433–462
(2011)
Lopez, V., Nikolov, A., Fernandez, M., Sabou, M., Uren, V.,
Motta, E.: Merging and ranking answers in the semantic web:
The wisdom of crowds. The semantic web pp. 135–152 (2009)
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 20 / 21
21. Bibliography II
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H.,
Tummarello, G.: Sindice. com: a document-oriented lookup
index for open linked data. International Journal of Metadata,
Semantics and Ontologies 3(1), 37–52 (2008)
Tran, T., Wang, H., Haase, P.: Searchwebdb: Data web search
on a pay-as-you-go integration infrastructure. Tech. rep.,
Technical report, University of Karlsruhe (2008)
Walter, S., Unger, C., Cimiano, P., B¨ar, D.: Evaluation of a
layered approach to question answering over linked data. In:
The Semantic Web–ISWC 2012, pp. 362–374. Springer (2012)
Augenstein, Gentile, Norton, Zhang, Ciravegna Query Expansion May 26, 2013 21 / 21