Online Relation Alignment for Linked Datasets

AIFB-KIT , FIZ-Karlsruhe Leibniz Institute1 30.05.17
KIT – The Research University in the Helmholtz Association
Institute of Applied Informatics andFormal Description Methods (AIFB), KIT – FIZ Karlsruhe Leibniz Institute
www.kit.edu
Online Relation Alignment for Linked Datasets
Maria Koutraki1, Nicoleta Preda2, Dan Vodislav3
1AIFB, FIZ-Karlsruhe, 2University of Paris-Saclay, 3University of Cergy-Pontoise

Motivation I
Dr. Maria Koutraki
Q1: What are the albums of Adele?
Q2: Where was Adele born?
Q3: What is the education Adele received?
?
Q1
Q2
Q3
Incomplete info!!
DBpedia Yago Freebase
Q1 artist created- artist.album
Q2 birthPlace wasBornIn person.place_of_birth
Q3 education graduatedFrom person.education
Search elsewhere…

Motivation II
Thousands of linked open datasets (Linked Open Data cloud)
Linked Data are inherently diverse:
Domain
Languages
Structure etc.
Different means and publishing mechanisms (RDF dumps, SPARQL)
LOD alignment is centralized towards KBs like DBpedia
LOD datasets usually align at instance level
Difficult to query and harness the complementary nature of LOD
Classes/Relations remain mostly unaligned
Dr. Maria Koutraki

Motivation III
Diverse schemas for representation in LOD
• ~600 schemas/vocabularies
used for representation
• Diverse quality of schemas[1]
• Duplicate representation of
similar concepts/classes and
relations
• Lack of explicit alignment
between classes/relations (with
only up to 2%)[2]
à disintegrated dataset
landscape
[1] Aimilia Magkanaraki, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis: Benchmarking RDF Schemas for the Semantic Web. International
Semantic Web Conference 2002:132-146
[2] Max Schmachtenberg,Christian Bizer, Heiko Paulheim: Adoption of theLinkedData Best Practices in Different Topical Domains. Semantic Web
Conference (1) 2014: 245-260
Dr. Maria Koutraki

Mining Alignments: State-of-the Art
Alignment at different granularities:
Instances (owl:sameAs links) [1,2,3]
Classes
Relations
Different “Alignment” discovery problems:
Similarity between relations (vague)q
Equivalence
Subsumption (very few works addressed it) [4, 5]
[1]C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee.Linked data on the Web. In WWW, 2008.
[2] C. Bohm, G. de Melo, F. Naumann,and G. Weikum. Linda:distributed web-of-data-scale entity matching.In CIKM, 2012.
[3] S. Lacoste-Julien,K.Palla,A. Davies,G. Kasneci, T. Graepel,and Z. Ghahramani.Sigma: Simple greedy matching for
aligning large knowledge bases.In KDD, 2013.
[4] Fabian M. Suchanek,Serge Abiteboul,Pierre Senellart:PARIS: Probabilistic Alignmentof Relations,Instances,and
Schema, VLDB, 2012.
[5] Luis Galárraga,Nicoleta Preda,Fabian M. Suchanek:Mining Rules to Align Knowledge Bases.CIKM, 2013.
Dr. Maria Koutraki

Objective: Relation Alignment for Linked Datasets
Given a source and a target RDF dataset
The entities of the two KBs are aligned via sameAs links
Goal: Compute alignments for relations (one-to-one):
Subsumption
Equivalence
RDF Dataset RDF Dataset
owl:sameAs
Dr. Maria Koutraki

Approach: Online Relation Alignment
Instance-based
Query-based: Align KBs published by SPARQL endpoints
Sample for a minimal set of entities to perform the alignment
process
Supervised Model (features computed on KB instances)
RDF Dataset RDF Dataset
owl:sameAs
SPARQL endpointSPARQL endpoint
Dr. Maria Koutraki

Overview – SORAL Architecture
Alignments of Relations
SORAL
Input:
Output:
1. Source KB
2. Target KB
Dr. Maria Koutraki

Overview – SORAL Architecture
Alignments of Relations
1. Candidates Generation
2. Supervised Model
Input:
Output:
1. Source KB
2. Target KB
Dr. Maria Koutraki

SORAL: Candidates Generation
Query for entities (x,y) that are instantiations to rS
rS
KBS
Dr. Maria Koutraki

y
x
rS
KBS
Dr. Maria Koutraki

(x,y) ≣ (x’, y’)
y
x
rS
x’
y’
owl:sameAs
owl:sameAs
KBS KBT
Dr. Maria Koutraki

(x,y) ≣ (x’, y’)
Query for relations that hold between (x’,y’)
y
x
rS
x’
y’
owl:sameAs
owl:sameAs
rTrT
--
KBS KBT
Dr. Maria Koutraki

(x,y) ≣ (x’, y’)
Query for relations that hold between (x’,y’)
y
x
rS
x’
y’
owl:sameAs
owl:sameAs
rTrT
--
KBS KBT
Candidates for alignment: {rS ⊆ rT1, rS ⊆ rT2, rS ⊆ rT3, … }
Dr. Maria Koutraki

SORAL: Efficiency Issues
Challenges
Bandwidth
Time-out at SPARQL endpoints
Approach
Reduce data transfers
Retrieve a subset of instances for rs
Solution: sample for a minimal subset of instances
First-N
Random
Stratified
Dr. Maria Koutraki
owl:Thing
Person
Artist
x1,x2…
x2…

Classify the alignments to correct or incorrect
Use features as… “Matchers”
SORAL: Supervised Model
Feature Group
Inductive Logic Programming (ILP)
General Statistics (GS)
Lexical
Dr. Maria Koutraki

Closed world assumption (cwa): “for a relation r the KB contains all the
facts.”
Features – ILP: CWA
Dr. Maria Koutraki
overlap(rs,rt)
|rs|cwaconf (rs ✓ rt) =
rS:sculptured rT:created
KBS KBT
…..

Closed world assumption (cwa): “for a relation r the KB contains all the
facts.”
Features – ILP: CWA
Dr. Maria Koutraki
overlap(rs,rt)
|rs|cwaconf (rs ✓ rt) = =
KBS KBT
…..
2
5

Features – ILP: PCA [Galárraga & all, WWW 2013]
Partial completeness assumption (pca): “for a subject x and relation r,
the KB contains ether all or none of the facts.”
Dr. Maria Koutraki
pcaconf (rs ✓ rt) = overlap(rs,rt)
overlap(rs,rt)+counter(rs,rt)
KBS KBT
…..

Features – ILP: PCA [Galárraga & all, WWW 2013]
Partial completeness assumption (pca): “for a subject x and relation r,
the KB contains ether all or none of the facts.”
Dr. Maria Koutraki
pcaconf (rs ✓ rt) = overlap(rs,rt)
overlap(rs,rt)+counter(rs,rt) = 2
2+1
KBS KBT
…..

Features – Relation Functionality
Functionality: “A relation r(x,y) is called functional if for x there are not
more than one y.”
If rs is subsumed in rt the functionality should be higher.
Target relations should have better coverage of facts.
Dr. Maria Koutraki

Features - ILP: PIA
Partial completeness assumption - pca
good performance for functional relations
Penalizes the non-functional relations
Propose: Partial incompleteness assumption – pia
The more important the counter example is the more should count!
Dr. Maria Koutraki

Features – GS: Type similarity
Type distribution similarity between relations rS and rT.
Example:
Weighted Jaccard similarity metric to assess if the two relations have
similar structure in terms of types.
High similarity – Good indicator for equivalence/subsumption between
relations.
Dr. Maria Koutraki
Book 30%
Movie 20%
…
Book 20%
Movie 30%
…
rT :hasWriterrS :hasCreator
High
similarity!!

Features – GS: Type dissimilarity
Dissimilarity as the ratio of types in rS that do not exist in rT.
Example:
For missing types we can accurately assess that rT does not subsume
rS.
Book 30%
Movie 20%
Song 5%
…
Book 20%
Movie 30%
Paintings 50%
…
rT :hasWriterrS :hasCreator
High
dissimilarity!!
Dr. Maria Koutraki

Experimental Setup
Knowledge Bases:
YAGO, DBpedia, Freebase (e.g. YAGO à DBpedia)
Relations:
Baselines:
cwa (used in PARIS) [Suchanek & all, VLDB 2012]
pca (used in ROSA) [Galárraga & all, CIKM 2013]
SORAL: Logistic Regression (any other supervised model can be
applied)
Ground Truth: Manually constructed by expert annotators based on:
Relation label
Relation instances
KB YAGO DBpedia Freebase
#relations 36 563 1666
Dr. Maria Koutraki

Evaluation Results: Performance
Full Data: Comparison of the different models and competitors
KBS KBT ROSA
(pcath=0.3)
PARIS
(cwath=0.1)
SORAL
P R F1 P R F1 P R F1
YAGO DBpedia .06 .68 .11 .42 .54 .48 .92 .73 .81
YAGO Freebase .03 1 .05 .40 1 .57 .82 .82 .82
DBpedia Freebase .05 .85 .09 .31 .65 .42 .69 .38 .49
DBpedia YAGO .18 .55 .27 .40 .45 .43 .57 .49 .53
Freebase DBpedia .34 .93 .50 .72 .57 .64 .87 .66 .75
Freebase YAGO .61 .86 .71 .73 .60 .66 .69 .74 .71
Average .21 .81 .29 .49 .63 .53 .76 .64 .69
Dr. Maria Koutraki

Evaluation Results: Performance
Sampled Data: Individual results on sampling – Stratified Level 3 – 500
entity samples
KBS KBT ROSA
(pcath=0.3)
PARIS
(cwath=0.1)
SORAL
P R F1 P R F1 P R F1
YAGO DBpedia .17 .75 .28 .71 .66 .68 1 .68 .81
YAGO Freebase .11 .78 .20 .55 .59 .57 .87 .67 .76
DBpedia Freebase .10 .67 .18 .31 .50 .40 .72 .36 .48
DBpedia YAGO .30 .72 .43 .70 .66 .68 .86 .60 .71
Freebase DBpedia .27 .79 .41 .65 .65 .65 .88 .51 .64
Freebase YAGO .22 .39 .28 .42 .37 .39 .72 .34 .46
Average .19 .68 .29 .55 .57 56. .84 .52 .64
Dr. Maria Koutraki

Evaluation Results: Efficiency
SPARQL Sampling time in
milliseconds
0
500
1000
1500
2000
2500
3000
3500
100 500 1000
milliseconds
Sample Size
firstN
random
str.lvl-2
str.lvl-3
str.lvl-4
str.lvl-5
str.lvl-6
0
20
40
60
80
100
120
140
160
100 500 1000
Kilobytes
Sample Size
firstN
random
str.lvl-2
str.lvl-3
str.lvl-4
str.lvl-5
str.lvl-6
Bandwidth usage in kilobytes
Dr. Maria Koutraki

Conclusions
Instance-based relation alignment approach to discover subsumptions
of relations
Set of light-weight features to decide for the correctness of the
subsumption relationship
Overcome main drawbacks of existing schema matching approaches
à efficient alignment algorithms
Harness the complementarity of LOD sources à relation alignment at
query time
Dr. Maria Koutraki

Thank you all !
Dr. Maria Koutraki
Questions ?

Features – GS: Relevance likelihood
Likelihood of ILP scores: depend on the datasets the matchers varies !!
Compute the likelihood of specific ILP scores being indicators of
subsumption for a relation pair!
pca likelihood
cwa likelihood
Joint pca & cwa likelihood
Compute the likelihood of a relation alignment being correct given a
specific ILP score.
Probabilities are measured on the training set! Assign the scores on the
test set.
Dr. Maria Koutraki

Owl:SameAs links
Dr. Maria Koutraki
KB pairs YAGO-DBpedia YAGO-Freebase DBpedia-Freebase
#owl:sameAs 2.886.308 2.730.652 3.873.432

Evaluation Results: Feature Ablation
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
P R F1
Feature Ablation
ALL GRS ILP LEX

Online Relation Alignment for Linked Datasets

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Online Relation Alignment for Linked Datasets

Semelhante a Online Relation Alignment for Linked Datasets (20)

Último

Último (20)

Online Relation Alignment for Linked Datasets