This document discusses defining similarity on the DBpedia knowledge graph. It provides context on similarity as a concept and outlines challenges in defining it for a heterogeneous graph like DBpedia, which contains nodes of different types connected by various relation types. Past approaches are noted to not fully leverage DBpedia's link structure. The document suggests network analysis methods like counting node-disjoint paths could help define similarity and provides an example of films linked in DBpedia. Ongoing work is noted to analyze DBpedia as a network to complement reasoning and inform tasks like recommendation. Challenges discussed include applying social network measures to DBpedia and evaluating proposed similarity techniques.
5. DBpedia Graph
Films - nodes - on DBpedia.
Some things about DBpedia:
Big, rich, dense Knowledge Base
→ 3.77m nodes, 400m edges (EN)
Lots of prior work (as we shall see...)
But very heterogeneous - vocabularies, categories
4
6. DBpedia Graph
Films - nodes - on DBpedia.
Some things about DBpedia:
Big, rich, dense Knowledge Base
→ 3.77m nodes, 400m edges (EN)
Lots of prior work (as we shall see...)
But very heterogeneous - vocabularies, categories
It is a graph
4
7. Similarity in general
Cognitive Science - Tversky (1977) - psychology - featural.
E.g. film: genre, language, director
Modelling of human thought, semantic relations, how do we
relate things to each other? (Quillian & Collins 1969)
5
8. Semantic
The notion of semantic networks is derived from the hierarchical
semantic memory model [Collins & Quillian, 1969]
6
9. Semantic Similarity
Different techniques:
Word frequency: Latent semantic analysis (doesn’t actually
use semantic net structure)
Rada (1989) - average shortest path length
Resnik (1999) - information content of lcs
7
10. Semantic Similarity
Different techniques:
Word frequency: Latent semantic analysis (doesn’t actually
use semantic net structure)
Rada (1989) - average shortest path length
Resnik (1999) - information content of lcs
Unfortunately...
Word frequency N/A
Often assumes hierarchical/tree structure of
taxonomy/ontology. (Both Rada and Resnik assume
taxonomy is an is-A hierarchy)
7
12. On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weighted
vector of Wikipedia articles, Explicit Semantic Analysis (ESA)
9
13. On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weighted
vector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -
similarity of neighbours
9
14. On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weighted
vector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -
similarity of neighbours
Passant (2010) - Linked Data Semantic Distance
9
15. On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weighted
vector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -
similarity of neighbours
Passant (2010) - Linked Data Semantic Distance
Mirizzi et al. (2012) uses DBpedia for movie recommendation
using a Vector Space Model
9
16. On DBpedia/Wikipedia
Recent applications:
Gabrilovich & Markovitch (2007) - express text as a weighted
vector of Wikipedia articles, Explicit Semantic Analysis (ESA)
Witten & Milne (2008) - the Wikipedia Link-based measure -
similarity of neighbours
Passant (2010) - Linked Data Semantic Distance ← uses paths!
Mirizzi et al. (2012) uses DBpedia for movie recommendation
using a Vector Space Model
10
17. Similarity
Important:
Properties can be related to each other
node type 2, e.g. film
node, e.g. director
type 1, e.g. influenced
type 2, e.g. collaborated with
11
18. Network Similarity
Social Network Analysis
Established field - notions of influence, centrality, rank etc.
Often applied to small networks
Note: Ranking is often based on similarity
12
23. Network Similarity
Applicability to DBpedia:
PageRank, SimRank - N/A - assumes homogeneous links!
Spreading Activation - possible with constraints
Apply PathSim - but how to learn such meta-paths?
15
24. Network Similarity
Applicability to DBpedia:
PageRank, SimRank - N/A - assumes homogeneous links!
Spreading Activation - possible with constraints
Apply PathSim - but how to learn such meta-paths?
Another idea:
Count node-disjoint paths.
Why? View each path as one distinct ‘reason’.
15
28. Summary
Similarity, useful concept in many areas, hard to define
how are films similar?
DBpedia, richly linked KB
film information available here
→ Problem: How to define similarity on DBpedia?
19
29. Summary
Similarity, useful concept in many areas, hard to define
how are films similar?
DBpedia, richly linked KB
film information available here
→ Problem: How to define similarity on DBpedia?
Past methods - don’t exploit linkedness
Network analysis methods can aid this
test trial with node-disjoint paths, GITS more similar to Matrix
than Totoro
19
30. Summary
Similarity, useful concept in many areas, hard to define
how are films similar?
DBpedia, richly linked KB
film information available here
→ Problem: How to define similarity on DBpedia?
Past methods - don’t exploit linkedness
Network analysis methods can aid this
test trial with node-disjoint paths, GITS more similar to Matrix
than Totoro
20
32. Ongoing/Future Work
Mining DBpedia as Network
Analyse structured and related data
Similarity as complement to – reasoning, retrieval, querying
Also useful in NLP, recommender systems, knowledge
discovery
→ Examples: work we do in UIMR
21
36. Challenges/Discussion
Challenges:
Topology of DBpedia graph
Standard SNA measures for homogeneous networks, e.g.
density, degree distribution - how to apply to DBpedia?
What does a path actually mean?
Which subgraphs to use?
How do metrics vary with different subgraphs, e.g. diff
ontologies/categories?
25
37. Challenges/Discussion
Challenges:
Topology of DBpedia graph
Standard SNA measures for homogeneous networks, e.g.
density, degree distribution - how to apply to DBpedia?
What does a path actually mean?
Which subgraphs to use?
How do metrics vary with different subgraphs, e.g. diff
ontologies/categories?
Scalability (not problem, but challenge)
Evaluation - how do we confirm something is similar?
25
38. Challenges/Discussion
Challenges:
Topology of DBpedia graph
Standard SNA measures for homogeneous networks, e.g.
density, degree distribution - how to apply to DBpedia?
What does a path actually mean?
Which subgraphs to use?
How do metrics vary with different subgraphs, e.g. diff
ontologies/categories?
Scalability (not problem, but challenge)
Evaluation - how do we confirm something is similar?
Thanks for listening! Questions/Suggestions?
25