This document describes the SORAL system for online relation alignment of linked datasets. SORAL generates candidate relation alignment pairs and then uses a supervised model to classify the pairs as correct or incorrect alignments. It samples entity instances from source dataset relations to issue queries to the target dataset. Features used for classification include inductive logic programming metrics, type similarity, functionality, and likelihood scores. Evaluation shows SORAL achieves better performance and efficiency than baseline approaches for relation alignment across datasets like YAGO, DBpedia and Freebase.
1. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute1 30.05.17
KIT – The Research University in the Helmholtz Association
Institute of Applied Informatics andFormal Description Methods (AIFB), KIT – FIZ Karlsruhe Leibniz Institute
www.kit.edu
Online Relation Alignment for Linked Datasets
Maria Koutraki1, Nicoleta Preda2, Dan Vodislav3
1AIFB, FIZ-Karlsruhe, 2University of Paris-Saclay, 3University of Cergy-Pontoise
2. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute2 30.05.17
Motivation I
Dr. Maria Koutraki
Q1: What are the albums of Adele?
Q2: Where was Adele born?
Q3: What is the education Adele received?
?
Q1
Q2
Q3
Incomplete info!!
DBpedia Yago Freebase
Q1 artist created- artist.album
Q2 birthPlace wasBornIn person.place_of_birth
Q3 education graduatedFrom person.education
Search elsewhere…
3. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute3 30.05.17
Motivation II
Thousands of linked open datasets (Linked Open Data cloud)
Linked Data are inherently diverse:
Domain
Languages
Structure etc.
Different means and publishing mechanisms (RDF dumps, SPARQL)
LOD alignment is centralized towards KBs like DBpedia
LOD datasets usually align at instance level
Difficult to query and harness the complementary nature of LOD
Classes/Relations remain mostly unaligned
Dr. Maria Koutraki
4. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute4 30.05.17
Motivation III
Diverse schemas for representation in LOD
• ~600 schemas/vocabularies
used for representation
• Diverse quality of schemas[1]
• Duplicate representation of
similar concepts/classes and
relations
• Lack of explicit alignment
between classes/relations (with
only up to 2%)[2]
à disintegrated dataset
landscape
[1] Aimilia Magkanaraki, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis: Benchmarking RDF Schemas for the Semantic Web. International
Semantic Web Conference 2002:132-146
[2] Max Schmachtenberg,Christian Bizer, Heiko Paulheim: Adoption of theLinkedData Best Practices in Different Topical Domains. Semantic Web
Conference (1) 2014: 245-260
Dr. Maria Koutraki
5. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute5 30.05.17
Mining Alignments: State-of-the Art
Alignment at different granularities:
Instances (owl:sameAs links) [1,2,3]
Classes
Relations
Different “Alignment” discovery problems:
Similarity between relations (vague)q
Equivalence
Subsumption (very few works addressed it) [4, 5]
[1]C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee.Linked data on the Web. In WWW, 2008.
[2] C. Bohm, G. de Melo, F. Naumann,and G. Weikum. Linda:distributed web-of-data-scale entity matching.In CIKM, 2012.
[3] S. Lacoste-Julien,K.Palla,A. Davies,G. Kasneci, T. Graepel,and Z. Ghahramani.Sigma: Simple greedy matching for
aligning large knowledge bases.In KDD, 2013.
[4] Fabian M. Suchanek,Serge Abiteboul,Pierre Senellart:PARIS: Probabilistic Alignmentof Relations,Instances,and
Schema, VLDB, 2012.
[5] Luis Galárraga,Nicoleta Preda,Fabian M. Suchanek:Mining Rules to Align Knowledge Bases.CIKM, 2013.
Dr. Maria Koutraki
6. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute6 30.05.17
Objective: Relation Alignment for Linked Datasets
Given a source and a target RDF dataset
The entities of the two KBs are aligned via sameAs links
Goal: Compute alignments for relations (one-to-one):
Subsumption
Equivalence
RDF Dataset RDF Dataset
owl:sameAs
Dr. Maria Koutraki
7. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute7 30.05.17
Approach: Online Relation Alignment
Instance-based
Query-based: Align KBs published by SPARQL endpoints
Sample for a minimal set of entities to perform the alignment
process
Supervised Model (features computed on KB instances)
RDF Dataset RDF Dataset
owl:sameAs
SPARQL endpointSPARQL endpoint
Dr. Maria Koutraki
8. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute8 30.05.17
Overview – SORAL Architecture
Alignments of Relations
SORAL
Input:
Output:
1. Source KB
2. Target KB
Dr. Maria Koutraki
9. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute9 30.05.17
Overview – SORAL Architecture
Alignments of Relations
1. Candidates Generation
2. Supervised Model
Input:
Output:
1. Source KB
2. Target KB
Dr. Maria Koutraki
10. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute10 30.05.17
SORAL: Candidates Generation
Query for entities (x,y) that are instantiations to rS
rS
KBS
Dr. Maria Koutraki
11. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute11 30.05.17
SORAL: Candidates Generation
Query for entities (x,y) that are instantiations to rS
y
x
rS
KBS
Dr. Maria Koutraki
12. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute12 30.05.17
SORAL: Candidates Generation
Query for entities (x,y) that are instantiations to rS
(x,y) ≣ (x’, y’)
y
x
rS
x’
y’
owl:sameAs
owl:sameAs
KBS KBT
Dr. Maria Koutraki
13. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute13 30.05.17
SORAL: Candidates Generation
Query for entities (x,y) that are instantiations to rS
(x,y) ≣ (x’, y’)
Query for relations that hold between (x’,y’)
y
x
rS
x’
y’
owl:sameAs
owl:sameAs
rTrT
--
KBS KBT
Dr. Maria Koutraki
14. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute14 30.05.17
SORAL: Candidates Generation
Query for entities (x,y) that are instantiations to rS
(x,y) ≣ (x’, y’)
Query for relations that hold between (x’,y’)
y
x
rS
x’
y’
owl:sameAs
owl:sameAs
rTrT
--
KBS KBT
Candidates for alignment: {rS ⊆ rT1, rS ⊆ rT2, rS ⊆ rT3, … }
Dr. Maria Koutraki
15. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute15 30.05.17
SORAL: Efficiency Issues
Challenges
Bandwidth
Time-out at SPARQL endpoints
Approach
Reduce data transfers
Retrieve a subset of instances for rs
Solution: sample for a minimal subset of instances
First-N
Random
Stratified
Dr. Maria Koutraki
owl:Thing
Person
Artist
x1,x2…
x2…
16. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute16 30.05.17
Classify the alignments to correct or incorrect
Use features as… “Matchers”
SORAL: Supervised Model
Feature Group
Inductive Logic Programming (ILP)
General Statistics (GS)
Lexical
Dr. Maria Koutraki
17. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute17 30.05.17
Closed world assumption (cwa): “for a relation r the KB contains all the
facts.”
Features – ILP: CWA
Dr. Maria Koutraki
overlap(rs,rt)
|rs|cwaconf (rs ✓ rt) =
rS:sculptured rT:created
KBS KBT
…..
18. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute18 30.05.17
Closed world assumption (cwa): “for a relation r the KB contains all the
facts.”
Features – ILP: CWA
Dr. Maria Koutraki
overlap(rs,rt)
|rs|cwaconf (rs ✓ rt) = =
rS:sculptured rT:created
KBS KBT
…..
2
5
19. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute19 30.05.17
Features – ILP: PCA [Galárraga & all, WWW 2013]
Partial completeness assumption (pca): “for a subject x and relation r,
the KB contains ether all or none of the facts.”
Dr. Maria Koutraki
pcaconf (rs ✓ rt) = overlap(rs,rt)
overlap(rs,rt)+counter(rs,rt)
rS:sculptured rT:created
KBS KBT
…..
20. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute20 30.05.17
Features – ILP: PCA [Galárraga & all, WWW 2013]
Partial completeness assumption (pca): “for a subject x and relation r,
the KB contains ether all or none of the facts.”
Dr. Maria Koutraki
pcaconf (rs ✓ rt) = overlap(rs,rt)
overlap(rs,rt)+counter(rs,rt) = 2
2+1
rS:sculptured rT:created
KBS KBT
…..
21. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute21 30.05.17
Features – Relation Functionality
Functionality: “A relation r(x,y) is called functional if for x there are not
more than one y.”
If rs is subsumed in rt the functionality should be higher.
Target relations should have better coverage of facts.
Dr. Maria Koutraki
22. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute22 30.05.17
Features - ILP: PIA
Partial completeness assumption - pca
good performance for functional relations
Penalizes the non-functional relations
Propose: Partial incompleteness assumption – pia
The more important the counter example is the more should count!
Dr. Maria Koutraki
23. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute23 30.05.17
Features – GS: Type similarity
Type distribution similarity between relations rS and rT.
Example:
Weighted Jaccard similarity metric to assess if the two relations have
similar structure in terms of types.
High similarity – Good indicator for equivalence/subsumption between
relations.
Dr. Maria Koutraki
Book 30%
Movie 20%
…
Book 20%
Movie 30%
…
rT :hasWriterrS :hasCreator
High
similarity!!
24. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute24 30.05.17
Features – GS: Type dissimilarity
Dissimilarity as the ratio of types in rS that do not exist in rT.
Example:
For missing types we can accurately assess that rT does not subsume
rS.
Book 30%
Movie 20%
Song 5%
…
Book 20%
Movie 30%
Paintings 50%
…
rT :hasWriterrS :hasCreator
High
dissimilarity!!
Dr. Maria Koutraki
25. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute25 30.05.17
Experimental Setup
Knowledge Bases:
YAGO, DBpedia, Freebase (e.g. YAGO à DBpedia)
Relations:
Baselines:
cwa (used in PARIS) [Suchanek & all, VLDB 2012]
pca (used in ROSA) [Galárraga & all, CIKM 2013]
SORAL: Logistic Regression (any other supervised model can be
applied)
Ground Truth: Manually constructed by expert annotators based on:
Relation label
Relation instances
KB YAGO DBpedia Freebase
#relations 36 563 1666
Dr. Maria Koutraki
26. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute26 30.05.17
Evaluation Results: Performance
Full Data: Comparison of the different models and competitors
KBS KBT ROSA
(pcath=0.3)
PARIS
(cwath=0.1)
SORAL
P R F1 P R F1 P R F1
YAGO DBpedia .06 .68 .11 .42 .54 .48 .92 .73 .81
YAGO Freebase .03 1 .05 .40 1 .57 .82 .82 .82
DBpedia Freebase .05 .85 .09 .31 .65 .42 .69 .38 .49
DBpedia YAGO .18 .55 .27 .40 .45 .43 .57 .49 .53
Freebase DBpedia .34 .93 .50 .72 .57 .64 .87 .66 .75
Freebase YAGO .61 .86 .71 .73 .60 .66 .69 .74 .71
Average .21 .81 .29 .49 .63 .53 .76 .64 .69
Dr. Maria Koutraki
27. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute27 30.05.17
Evaluation Results: Performance
Sampled Data: Individual results on sampling – Stratified Level 3 – 500
entity samples
KBS KBT ROSA
(pcath=0.3)
PARIS
(cwath=0.1)
SORAL
P R F1 P R F1 P R F1
YAGO DBpedia .17 .75 .28 .71 .66 .68 1 .68 .81
YAGO Freebase .11 .78 .20 .55 .59 .57 .87 .67 .76
DBpedia Freebase .10 .67 .18 .31 .50 .40 .72 .36 .48
DBpedia YAGO .30 .72 .43 .70 .66 .68 .86 .60 .71
Freebase DBpedia .27 .79 .41 .65 .65 .65 .88 .51 .64
Freebase YAGO .22 .39 .28 .42 .37 .39 .72 .34 .46
Average .19 .68 .29 .55 .57 56. .84 .52 .64
Dr. Maria Koutraki
28. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute28 30.05.17
Evaluation Results: Efficiency
SPARQL Sampling time in
milliseconds
0
500
1000
1500
2000
2500
3000
3500
100 500 1000
milliseconds
Sample Size
firstN
random
str.lvl-2
str.lvl-3
str.lvl-4
str.lvl-5
str.lvl-6
0
20
40
60
80
100
120
140
160
100 500 1000
Kilobytes
Sample Size
firstN
random
str.lvl-2
str.lvl-3
str.lvl-4
str.lvl-5
str.lvl-6
Bandwidth usage in kilobytes
Dr. Maria Koutraki
29. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute29 30.05.17
Conclusions
Instance-based relation alignment approach to discover subsumptions
of relations
Set of light-weight features to decide for the correctness of the
subsumption relationship
Overcome main drawbacks of existing schema matching approaches
à efficient alignment algorithms
Harness the complementarity of LOD sources à relation alignment at
query time
Dr. Maria Koutraki
30. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute30 30.05.17
Thank you all !
Dr. Maria Koutraki
Questions ?
31. AIFB-KIT , FIZ-Karlsruhe Leibniz Institute31 30.05.17
Features – GS: Relevance likelihood
Likelihood of ILP scores: depend on the datasets the matchers varies !!
Compute the likelihood of specific ILP scores being indicators of
subsumption for a relation pair!
pca likelihood
cwa likelihood
Joint pca & cwa likelihood
Compute the likelihood of a relation alignment being correct given a
specific ILP score.
Probabilities are measured on the training set! Assign the scores on the
test set.
Dr. Maria Koutraki