This document discusses research into automatically discovering strong relationships between entities in Linked Data using genetic programming. The researchers aim to learn a cost function that can guide uninformed searches over Linked Data to find the most promising relationship paths. They experiment with different topological and semantic features as inputs to genetic programming to learn cost functions. The best-performing cost functions incorporate features like namespace variety, conditional node degree, and topics. This suggests specific, well-described paths through entities of different types are indicators of strong relationships in Linked Data.
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Learning to assess Linked Data relationships using Genetic Programming
1. Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta
Learning to Assess
Linked Data Relationships
Using Genetic Programming
@IlaTiddi
20.10.2016
15th International Semantic Web Conference (ISWC 2016)
2. Research Problem
Automatically discover what makes a strong relationship
between two entities in (the Web of) Linked Data.
• relationship : a semantic path between two entities
ASongOfIceAnd
Fire(novel)
GoTASongOfIce
AndFire(topic)
dc:subject dc:subject
3. Research Problem
Automatically discover what makes a strong relationship
between two entities in (the Web of) Linked Data.
• relationship : a semantic path between two entities
• automatically : through graph search techniques
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
:born
:airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
4. Research Problem
Problem
• Entities/properties in a path might come from a number
of different, unknown data sources
Solution (the easy one)
• indexing & preprocessing of a portion of Linked Data
• a priori knowledge, computational resources
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
:born
:airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
5. Research Problem
Solution
• Find paths between entities through Link Traversal
• Incremental and agnostic graph exploration
• Perform uninformed (or blind) search over Linked Data
ASongOfIceAnd
Fire(novel)
GoT
6. Research Problem
Solution
• Find paths between entities through Link Traversal
• Incremental and agnostic graph exploration
• Perform uninformed (or blind) search over Linked Data
ASongOfIceAnd
Fire(novel)
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
dc:subject
Fantasy
dc:subject
7. Research Problem
Solution
• Find paths between entities through Link Traversal
• Incremental and agnostic graph exploration
• Perform uninformed (or blind) search over Linked Data
ASongOfIceAnd
Fire(novel)
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
dc:subject
Fantasy
dc:subject
8. Research Problem
Solution
• Find paths between entities through Link Traversal
• Incremental and agnostic graph exploration
• Perform uninformed (or blind) search over Linked Data
ASongOfIceAnd
Fire(novel)
GoTASongOfIce
AndFire(topic)
dc:subject
Fantasy
dc:subject
UnitedStates:bornGeorgeRRMartin
:author
9. Research Problem
Solution
• Find paths between entities through Link Traversal
• Incremental and agnostic graph exploration
• Perform uninformed (or blind) search over Linked Data
ASongOfIceAnd
Fire(novel)
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
dc:subject
Fantasy
dc:subject
UnitedStates:born
10. Research Problem
Solution
• Find paths between entities through Link Traversal
• Incremental and agnostic graph exploration
• Perform uninformed (or blind) search over Linked Data
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author
dc:subjectdc:subject
Fantasy
dc:subject
:born
11. Research Problem
Solution
• Find paths between entities through Link Traversal
• Incremental and agnostic graph exploration
• Perform uninformed (or blind) search over Linked Data
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author :airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
:born
13. Research Hypothesis
Problem
Uninformed searches require a cost-function to explore the
graph following the most promising paths
Hypo
Linked Data information can drive a cost-function that
detects strong relationships between entities
ASongOfIceAnd
Fire(novel)
UnitedStates
GoT
GeorgeRRMartin
ASongOfIce
AndFire(topic)
:author :airedIn
dc:subjectdc:subject
Fantasy
dc:subject dc:subject
:born
14. Research Questions
What makes a path strong?
• Which topological or semantic features of nodes/edges?
✗ e.g. length of a path?
entities of different datasets are connected by many paths
of similar length
How can we use Linked Data to assess strong relationships?
• Which information do we need?
• Can we use structural features of the graph?
Challenges
• find topological/semantic features to detect strong relationships
• combine these features in a cost-function
• perform an effective blind search
15. Proposed Approach
• A set of topological/semantic characteristics of
the Linked Data graph
• a benchmark of human-evaluated relationship
paths
Identify the cost-function for a blind search that
best performs in ranking sets of alternative
relationship paths
Automatically learn a cost-function to detect strong
relationships between Linked Data entities using a
supervised method (Genetic Programming)
16. Proposed Approach
Genetic Programming: why?
• Flexible learning process
• Suitable for wide search spaces (such as Linked Data)
• Results assessed with a fitness (scores vs. functions)
• Human-understandable results
• Easy to integrate in a graph search
Automatically learn a cost-function to detect strong
relationships between Linked Data entities using a
supervised method (Genetic Programming)
VS
17. Genetic Programming
Programs (solutions for a problem)
• trees of primitives
• functions : internal nodes (mathematical or logical
operations)
• terminals : leaf nodes (constants or variables)
Fitness function (evaluation)
• how well the program solves the problem
Genetic operations (evolution)
• reproduction
• crossover from two parents
• mutation from one parent
Termination condition
• maximum number of evolutions
• a desired fitness
18. Genetic Programming
Procedure
• Create random population of programs based on the primitives
• Evolve population until an ideal situation is met
✗✗
✗
✔✔✗✗ ✔
canned spaghetti meatballs spaghetti tomato sauced penne tomato sauced spaghetti
19. Genetic Programming
Given
• a starting population of randomly generated cost-functions
• sets of alternative paths between two Linked Data entities,
ranked by humans
Determine how good each cost-function is in ranking paths
compared to the human evaluators
✗✗
✗
✔✔✗✗ ✔
canned spaghetti meatballs spaghetti tomato sauced penne tomato sauced spaghetti
20. Genetic Programming
Primitives
Constant terminals
• Z= {0, 1000}
Aggregated terminals
• Topological edge weighs
indegree, outdegree, constant weight
• Semantic edge weighs
usage of namespaces, taxonomies, vocabularies
• Aggregators along the path
sum, avg, min, max
Functions (combining different information)
• Math operations
addition, multiplication, division, log
21. Genetic Programming
Fitness
Normalised Discounted Cumulative Gain (nDCG)
• (IR) quality of rankings provided by search engines based on
the graded relevance of the returned documents
• how good is a program in ranking paths based on human ranks
• avg(nDCG) across the dataset
• length penalty
Genetic operations
• Reproduction
• Crossover
• Mutation
Learning
• Training set + test set
• Keep fittest program for each runs on training set
• Test them (discard inconsistent)
22. Experiments
Dataset
Entities (random types from different sources)
• 12,630 events from Yago
• 8,185 people from the VIAF dataset
• 999 movies from the LMDB
• 1,174 countries/capitals from Geonames/ the UNESCO dataset
Paths (a set of possible paths between them)
• select a random pair
• bidirectional breadth-first search
Assessment
• 100 pairs (~10 possible paths per pair)
• 8 judges
• from (2) highly relevant to (0) not relevant
db:Dina-
Korzun
viaf:Dina-
Korzun
gn:Europe
gn:United-
Kingdom
lmdb:The
SkinGame
owl:sameAsdbo:citizenship
gno:parent
Feature
foaf:based
_near
24. Experiments
Results
Lower performance for T-runs and N-runs
Recurrent terminals
• conditional degree (node degree depending on the RDF triple)
• namespace variety
• number of topic properties (dc:subject/skos:broader/foaf:primaryTopic)
Runs Best program Fitness TR Fitness TS
T1 log(log(min.cd × min.cd))/max.cd 0.79 0.79
T2 log(min.cd)/(avg.cd + 87) 0.77 0.78
T3 min.cd × (min.cd/max.cd) 0.78 0.72
N1 (log((max.ns/max.cd))/avg.ns) + min.ns 0.82 0.81
N2 (min.dg/sum.cd)/sum.ou) + min.ns 0.79 0.77
N3 min.ns/(log(max.cd)/avg.ns) 0.83 0.75
S1 min.ns + (sum.ns/log(log(sum.si))) 0.88 0.83
S2 min.ns + (min.cd/log(log(sum.si))) 0.88 0.86
S3 min.ns + (log(max.in)/log(log(sum.si))) 0.87 0.86
25. Experiments
Comparative evaluation
Best programs
• automatically learnt
vs. literature functions
• RECAP,RelFinder,Everything Is Connected Engine, Moore et al.
• ad-hoc / handcrafted information theoretical measures
26. Experiments
Which cost-function?
Interpretation
• pass through nodes with rich node descriptions
higher min_namespaces = higher path score
• not high level entities / few topic categories
few incoming topic categories = higher path score
• more specific entities (not hubs) for path with few topic categories
ratio conditional_degree / inTopicCategories
specific paths are privileged over general paths
min_namespaces+
min_conditionalDegree
log(log(sum_inTopicCategories))
27. Conclusions
Contributions
A measure to detect strong relationships in Linked Data
can be integrated in uninformed searches over Linked Data
vs. indexing/pre-processing techniques
derived empirically through Genetic Programming
vs. domain-specific / handcrafted measures
what is important in Linked Data
topological features + little knowledge about the edge vocabulary
Future work
• Integrate the measure in the blind-search process
• Explore more characteristics
• Improve the measure
28. THANK YOU VERY MUCH
(AND DO NOT MESS UP WITH ITALIAN FOOD)
Questions?
IlaTiddi ilaria.tiddi@open.ac.uk
Notas do Editor
you need to know these datasets
computational efforts that are not necessarily required
LT which allows
this is equivalent to performing
to avoid inconclusive searches
there a series of qs to be answered bablabla
and if so the challenges are
effective = com
a a set of possible topological or semantic features of the nodes and edges in LD
a a set of possible topological or semantic features of the nodes and edges in LD
a a set of possible topological or semantic features of the nodes and edges in LD
a a set of possible topological or semantic features of the nodes and edges in LD
fitting GP to our problem
combination of edge weighting functions
given this dataset
unwieghted fitness on trainset/testset
unwieghted fitness on trainset/testset
RelFinder, Recap, Everything is connected Engine, Moore et al.
paths representing the strongest relationships
in very simple words
prioritises specific paths (e.g. a movie and a person are based in the same region) to more general paths (e.g. a movie and a person are based in the same country).
only specific entities (not hubs) for paths with a small number of topic categories. (the ratio between min.cd and log(log(sum.si)) is negative if sum.si is lower than 10)
Dataset stability
Removal of entities from one data source at a time
S-runs programs remain consistent