Determining the semantic relatedness (i.e., the strength of a relation) of two resources in DBpedia (or other Linked Data sources) is a problem addressed by quite a few approaches in the recent past. However, there are no large-scale benchmark datasets for comparing such approaches, and it is an open problem to determine which of the approaches work better than others. Furthermore, larget-scale datasets for training machine learning based approaches are not available. DBpediaNYD is a large-scale synthetic silver standard benchmark dataset which contains symmetric and asymmetric similarity values, obtained using a web search engine.
Apidays New York 2024 - The value of a flexible API Management solution for O...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia
1. DBpediaNYD –
A Silver Standard Benchmark Dataset
for Semantic Relatedness in DBpedia
10/22/13 Paulheim Heiko Paulheim
Heiko
1
2. Motivation
•
There are quite a few approaches to entity ranking/
statement weighting on Linked Data
– and DBpedia in particular
•
Examples:
– Franz et al. (2009) – Tensor Decomposition
– Meij et al. (2009) – Machine Learning
– Mirizzi et al. (2010) – Web Search Engines
– Mulay and Kumar (2011) – Machine Learning
– Hees et al. (2012) – Crowd Sourcing
– Nunes et al. (2012) – Social Network Analysis
10/22/13
Heiko Paulheim
2
3. Motivation
•
However,
– none of those have been competitively evaluated
– none of those have been evaluated at large scale
•
Evaluation with
– small private data sets
– user studies
•
Approaches using Machine Learning
– requires training data
– expensive to obtain
10/22/13
Heiko Paulheim
3
4. The Dataset
•
Large-scale dataset (several thousand instances)
– statements with strengths
•
Strength value: Normalized Google Distance
•
f(x): number of search results containing x
•
f(x,y): number of search results containing both x and y
•
M: number of pages in search engine index
•
NGD has been shown to correlate with human strength associations
10/22/13
Heiko Paulheim
4
5. The Dataset
•
NGD is a symmetric value
– NYD dataset also contains asymmetric values
•
Asymmetric Normalized Google Distance
•
f(x): number of search results containing x
•
f(x,y): number of search results containing both x and y
•
M: number of pages in search engine index
10/22/13
Heiko Paulheim
5
6. Constructing the Dataset
•
We sampled 10,000 statements
– with DBpedia resources as subject and object
(e.g., no type statements, no literals)
– with dbpedia or dbpprop predicate
•
...and computed symmetric/asymmetric NGD
– using the labels as search strings
– using Yahoo BOSS
10/22/13
Heiko Paulheim
6
7. The Dataset
•
Random sample of 10,000 statements
– i.e., 30,000 search engine calls (80c/1,000 → 24 USD)
•
3,058 pairs of resources had to be discarded
– f(x)<f(x,y) or f(y)<f(x,y)
– search engines sometimes don't count properly :-(
•
Result:
– 6,942 weighted statements (symmetric)
– 13,884 weighted statements (asymmetric)
10/22/13
Heiko Paulheim
7
8. The Dataset
•
Example:
– dbpedia:John_Lennon and dbpedia:Yoko_Ono
•
Distances:
– symmetric: 0.18
– John Lennon → Yoko Ono 0.18
– Yoko Ono → John Lennon 0.03
•
Explanation:
– Yoko Ono is famous for being John Lennon's wife
• and most often mentioned in that context
– John Lennon is more famous for being a member of the Beatles
10/22/13
Heiko Paulheim
8
9. Example: the DBpedia FindRelated Service
•
We trained two regression SVMs (LibSVM) based on DBpediaNYD
– one for symmetric, one for asymmetric
– service allows for finding the most related among the linked resources
•
Example results:
•
http://wiki.dbpedia.org/FindRelated
10/22/13
Heiko Paulheim
9
10. Conclusion and Outlook
•
DBpediaNYD allows for large scale evaluation
– rather a silver standard
– does not replace manually created gold standards
•
Future work
– validate DBpediaNYD with users
– compare search engines
10/22/13
Heiko Paulheim
10
11. Something Completely Different
•
Challenges enumerated in the workshop intro this morning
– “Logical inference on noisy data”
•
Talk on “Type Inference on Noisy RDF Data”
– Was actually applied for DBpedia 3.9
– Friday, 3:15, Bayside 204A
10/22/13
Heiko Paulheim
11
12. DBpediaNYD –
A Silver Standard Benchmark Dataset
for Semantic Relatedness in DBpedia
10/22/13 Paulheim Heiko Paulheim
Heiko
12