SlideShare a Scribd company logo
1 of 40
Download to read offline
On the Impact of sameAs
on Schema Matching
JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH
Knowledge Representation & Reasoning Group
KCAP 2019 - November 20, 2019 - Marina del Rey
More and more Linked Open Data...
2
(crawled from ~650K datasets in 2015)
...More and more (overlapping) Schemas
3
Schema Matching is Inevitable
4
• It is not possible (neither desired) to have a unique schema covering all domains
• In order to exploit this wealth of available knowledge and enhance knowledge-based
systems (e.g., search engines, virtual assistants, etc.), we need to match these
overlapping schemas
• Schema Matching: finding relationships between entities of different schemas
• equivalence relations
• subsumption
• disjointness
• ....
Schema Matching over the years
5
• Active area of research from several communities, including the SW
• Ontology Alignment Evaluation Initiative (OAEI) ongoing for 15 years
• [Euzenat and Shvaiko, 2013] reviews ~100 schema-matching systems
50%
rely mostly on schema-level information
(i.e. schema-based approaches)
25%
rely mostly on instance-level information
(i.e. instance-based approaches)
25%
rely on schema + instance-level information
(i.e. mixed approaches)
6
Instance-based Schema Matching
• All instance-based schema-matching approaches share two essential ideas:
1. The semantics of a concept is better determined by its members rather by its annotations
Concepts refer to sets that possibly have named instances as members
• ext(C) refer to the set of instances which are explicitly stated as members of C
ext(foaf:Person) = {ex:i1, ex:i2}
• ext⊑(C) refer to the set of instances which are explicitly or implicitly
stated as members of C
(i.e. either explicit members or derived through concept subsumption)
ext⊑(foaf:Person) = {ex:i1, ex:i2, ex:i3}
foaf:Person
ex:i1 ex:i2
ex:i3
rdf:type
dbo:Scientist
rdfs:subClassOf
rdf:type
7
Instance-based Schema Matching
• All instance-based schema-matching approaches share two essential ideas:
2. The more significant the overlap between two concepts’ members is, the more related
these concepts are
• Multiple techniques to measure the overlap between concepts’ members
• Formal concept analysis techniques
• Machine learning
• Jaccard index
• ....
Instance-based Schema Matching using Jaccard Index
8
• The Jaccard index is a commonly used measure to score the similarity between two sets
• The higher the similarity of two sets is, the greater the Jaccard index
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4}
ext(schema:Person) = {ex:i3, ex:i4, ex:i5}
J(ext(foaf:Person), ext(schema:Person)) =
.
/
= 0.4
Instance-based Schema Matching using Jaccard Index
9
With more than 558 million explicitly asserted owl:sameAs [Beek et al., ESWC 2018]
(or 35 billion after transitive closure), the reality in the Web of Data looks more like this:
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
owl:sameAs*
ext~(foaf:Person) = {ex:i1, eq{2,5}, ex:i3, ex:i4}
ext~(schema:Person) = {ex:i3, ex:i4, eq{2,5}}
J(ext~(foaf:Person), ext~(schema:Person)) =
3
4
= 0.75
ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4}
ext(schema:Person) = {ex:i3, ex:i4, ex:i5}
J(ext(foaf:Person), ext(schema:Person)) =
.
/
= 0.4
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
Scenario 1 where J increases
Instance-based Schema Matching using Jaccard Index
10
Or possibly like this:
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
owl:sameAs*
ext~(foaf:Person) = {ex:i1, ex:i2, eq{3,4}}
ext~(schema:Person) = {eq{3,4}, ex:i5}
J(ext~(foaf:Person), ext~(schema:Person)) =
7
4
= 0.25
ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4}
ext(schema:Person) = {ex:i3, ex:i4, ex:i5}
J(ext(foaf:Person), ext(schema:Person)) =
.
/
= 0.4
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
Scenario 2 where J decreases
11
Research Questions
Research Question 1
Does the inclusion of instance-level interlinks positively impact instance-based schema
alignments ?
1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts?
1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts?
12
Why should we care?
• Providing empirical evidences for schema-matching designers on whether exploiting a large
external collection of instance-level interlinks (e.g. from the LOD Cloud) is beneficial for
improving the accuracy of schema-matching techniques
• Particularly showing the risk/interest of using owl:sameAs after a number of studies suggesting
that a large* number of the existing owl:sameAs links in the Web are actually erroneous
* 3% of evaluated owl:sameAs are erroneous [Hogan et al., JWS 2012]
* 4% of evaluated owl:sameAs are erroneous [Raad et al., ISWC 2018]
* 20% of evaluated owl:sameAs are erroneous [Halpin et al., ISWC 2010]
13
Research Question 2
How does the quality of the instance-level interlinks impact the quality of the resulting
schema-alignments ?
2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of equivalent concepts?
2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
14
15
Dataset Description
Dataset
16
(crawled from ~650K datasets in 2015)
Dataset
17
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674
Size distribution of the Concepts’ members
18
• 23% of the concepts have one explicit member
• 92% of the concepts have ≤ 100 explicit members
• 618 concepts have more than 100M explicit or implicit members
• 5 concepts have more than 100M explicit members
Concepts with more than >100M explicit members
19
Concept Cardinality %
http://purl.org/linked-data/cube#Observation 1,306,389,396 39.3
http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry 304,878,654 9.2
http://geovocab.org/geometry#Geometry 167,808,111 5
http://knoesis.wright.edu/ssw/ont/ sensorobservation.owl#MeasureData 144,044,989 4.3
http://xmlns.com/foaf/0.1/Person 132,919,327 4
Total 2,056,040,477 61.9
20
Experiments
Research Question 1
Does the inclusion of instance-level interlinks impact positively instance-based schema
alignments ?
1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts?
1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts?
21
Dataset
22
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674
1. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
• 1,051,979 owl:equivalentClass statements in the LOD-a-lot
­ Hypothesis: all these statements are correct alignments
­ Only 972 owl:equivalentClass statements where both concepts have explicit members
­ 208 reflexive alignments (C1, owl:equivalentClass, C1)
­ 22 duplicate symmetric alignments (C1, owl:equivalentClass, C2) and (C2, owl:equivalentClass, C1)
­ 742 alignments between 1,357 distinct concepts (i.e. gold standard)
23
1. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
24
Size distribution of the Concepts’ members of our Gold Standard
1. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
25
Jaccard Index distribution for the 742 alignments
Compare
J(ext⊑(C1), ext⊑(C2))
with
J(ext⊑
~(C1), ext⊑
~(C2))
such that
(C1, owl:equivalentClass, C2)
Runtime: <4 hours on 64GB SSD disk
1. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
26
Jaccard Index variation for the 742 alignments
• When owl:sameAs is considered, the Jaccard index
increases for 381/ 742 of the correct alignments (52%)
• Out of these 381 cases, Jaccard increases from 0 to 1 in
44 cases (7%)
• When owl:sameAs is considered, the Jaccard index
decreases for 25/ 742 of the correct alignments (3%)
• Slight drop in impact when only explicit members are
considered
owl:sameAs increases the overlap of two equivalent
concepts in half of the cases
Research Question 1
Does the inclusion of instance-level interlinks impact positively instance-based schema
alignments ?
1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts?
1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts?
27
Dataset
28
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674
1. b. Does the inclusion of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
• 833,232 concepts with explicit members in the LOD-a-lot
­ Created a random alignment for each concept such that each concept is paired only once
­ Hypothesis: all these random alignments are erroneous
­ 416,616 random alignments
29
1. b. Does the inclusion of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
30
Jaccard Index variation for the 416,616 random alignments
• When owl:sameAs is considered, the Jaccard index increases
for only 94 / 416,616 of the random alignments (0.02%)
• When owl:sameAs is considered, the Jaccard index
decreases for 3 / 416,616 of the random alignments (~0%)
owl:sameAs rarely increases the overlap of two non-
equivalent concepts
Research Question 2
How does the quality of the instance-level interlinks affect the quality of the resulting
schema-alignments ?
2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of equivalent concepts?
2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
31
Dataset
32
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674
2. a. Does the inclusion of a higher quality subset of owl:sameAs
increase the Jaccard Index of equivalent concepts?
33
owl:sameAs quality indicator:
• [Raad et al., ISWC 2018] computed an error degree for each owl:sameAs link
• Error degree [0.0, 1.0] based on the community structure of the owl:sameAs network
and the symmetry of the links
• Manual evaluation shows that owl:sameAs links with error degree >0.99 have high
probability of being erroneous
• Manual evaluation shows that owl:sameAs links with error degree ≤0.4 have high
probability of being correct
• Transitive closure (a): after discarding the ~1M owl:sameAs link with error degree >0.99
• Transitive closure (b): after discarding the ~150M owl:sameAs link with error degree >0.4
2. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
34
Jaccard Index variation for the 742 alignments
• When owl:sameAs with error degree >0.99 are discarded, there is
a slight decrease in performance:
• Jaccard increases in 51% of the cases (instead of 52%)
• Jaccard increases from 0 to 1 in 37 cases (instead of 44)
• When owl:sameAs with error degree >0.4 are discarded, there is
a big decrease in performance:
• Jaccard increases in only 13% of the cases (instead of 52%)
• Jaccard increases from 0 to 1 in two cases (instead of 44)
• Jaccard decreases in 39 cases (instead of 25)
Research Question 2
How does the quality of the instance-level interlinks affect the quality of the resulting
schema-alignments ?
2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of equivalent concepts?
2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
35
2. b. Does the inclusion of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
36
Jaccard Index variation for the 416,616 random alignments
• When owl:sameAs with error degree >0.99 are discarded,
there is an increase in performance:
• Jaccard increases in 27 of the cases (instead of 94)
• When owl:sameAs with error degree >0.4 are discarded,
there is an important increase in performance:
• Jaccard increases in 2 of the cases (instead of 94)
37
Take away message
Take away message
This work provides an empirical study on the impact of including instance-level interlinks
on the overlap between concepts members
• Including instance-level interlinks can enhance the performance of instance-based schema alignments
­ Increases the overlap for 52% of the existing alignments in the LOD-a-lot
­ Increases the overlap for less than 0.3% of randomly created alignments
• Inference does positively impact instance-based schema alignments
­ Considering also the implicit members enhances the results on the Gold Standard by 3 pp
• Discarding only isolated owl:sameAs links in the network can increase the quality of instance-based schema
alignments (owl:sameAs links are probably not as bad as we first thought)
­ Reduces the cases where Jaccard index increases for non-equivalent concepts by 71%
38
Future Work
• Use a larger gold standard to validate the here presented results
• Investigate alternative better-tailored instance-based measures to detect new schema
alignments at the scale of the Web of data, exploiting:
­ the higher quality subset of the owl:sameAs links;
­ the concepts’ both explicit and implicit
39
On the Impact of sameAs
on Schema Matching
JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH
Knowledge Representation & Reasoning Group
Thank you for your attention...
@joeraad_
Code + Results
https://github.com/raadjoe/impact-sameAs-schema-matching

More Related Content

What's hot

Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Lucidworks
 
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaQuery Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
YI-JHEN LIN
 

What's hot (14)

Sina presentation in IBM
Sina presentation in IBMSina presentation in IBM
Sina presentation in IBM
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and Python
 
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaQuery Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
 
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
 
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)
 
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
 

Similar to On the Impact of sameAs on Schema Matching

Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
Mohamed BEN ELLEFI
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
Lucy McKenna
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Sri Ambati
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
feiwin
 

Similar to On the Impact of sameAs on Schema Matching (20)

Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 
Discovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory searchDiscovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory search
 
Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
 
ICELW Conference Slides
ICELW Conference SlidesICELW Conference Slides
ICELW Conference Slides
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Utilizing Neo4j with AI Applications
Utilizing Neo4j with AI ApplicationsUtilizing Neo4j with AI Applications
Utilizing Neo4j with AI Applications
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
Show me the data! Actionable insight from open courses
Show me the data! Actionable insight from open coursesShow me the data! Actionable insight from open courses
Show me the data! Actionable insight from open courses
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
 

Recently uploaded

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Recently uploaded (20)

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 

On the Impact of sameAs on Schema Matching

  • 1. On the Impact of sameAs on Schema Matching JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH Knowledge Representation & Reasoning Group KCAP 2019 - November 20, 2019 - Marina del Rey
  • 2. More and more Linked Open Data... 2 (crawled from ~650K datasets in 2015)
  • 3. ...More and more (overlapping) Schemas 3
  • 4. Schema Matching is Inevitable 4 • It is not possible (neither desired) to have a unique schema covering all domains • In order to exploit this wealth of available knowledge and enhance knowledge-based systems (e.g., search engines, virtual assistants, etc.), we need to match these overlapping schemas • Schema Matching: finding relationships between entities of different schemas • equivalence relations • subsumption • disjointness • ....
  • 5. Schema Matching over the years 5 • Active area of research from several communities, including the SW • Ontology Alignment Evaluation Initiative (OAEI) ongoing for 15 years • [Euzenat and Shvaiko, 2013] reviews ~100 schema-matching systems 50% rely mostly on schema-level information (i.e. schema-based approaches) 25% rely mostly on instance-level information (i.e. instance-based approaches) 25% rely on schema + instance-level information (i.e. mixed approaches)
  • 6. 6 Instance-based Schema Matching • All instance-based schema-matching approaches share two essential ideas: 1. The semantics of a concept is better determined by its members rather by its annotations Concepts refer to sets that possibly have named instances as members • ext(C) refer to the set of instances which are explicitly stated as members of C ext(foaf:Person) = {ex:i1, ex:i2} • ext⊑(C) refer to the set of instances which are explicitly or implicitly stated as members of C (i.e. either explicit members or derived through concept subsumption) ext⊑(foaf:Person) = {ex:i1, ex:i2, ex:i3} foaf:Person ex:i1 ex:i2 ex:i3 rdf:type dbo:Scientist rdfs:subClassOf rdf:type
  • 7. 7 Instance-based Schema Matching • All instance-based schema-matching approaches share two essential ideas: 2. The more significant the overlap between two concepts’ members is, the more related these concepts are • Multiple techniques to measure the overlap between concepts’ members • Formal concept analysis techniques • Machine learning • Jaccard index • ....
  • 8. Instance-based Schema Matching using Jaccard Index 8 • The Jaccard index is a commonly used measure to score the similarity between two sets • The higher the similarity of two sets is, the greater the Jaccard index 𝐽(𝐴, 𝐵) = | * ∩ , | | * ∪ , | foaf:Person ex:i1 ex:i2 schema:Person ex:i3 ex:i4 rdf:type ex:i5 rdf:type ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4} ext(schema:Person) = {ex:i3, ex:i4, ex:i5} J(ext(foaf:Person), ext(schema:Person)) = . / = 0.4
  • 9. Instance-based Schema Matching using Jaccard Index 9 With more than 558 million explicitly asserted owl:sameAs [Beek et al., ESWC 2018] (or 35 billion after transitive closure), the reality in the Web of Data looks more like this: 𝐽(𝐴, 𝐵) = | * ∩ , | | * ∪ , | owl:sameAs* ext~(foaf:Person) = {ex:i1, eq{2,5}, ex:i3, ex:i4} ext~(schema:Person) = {ex:i3, ex:i4, eq{2,5}} J(ext~(foaf:Person), ext~(schema:Person)) = 3 4 = 0.75 ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4} ext(schema:Person) = {ex:i3, ex:i4, ex:i5} J(ext(foaf:Person), ext(schema:Person)) = . / = 0.4 foaf:Person ex:i1 ex:i2 schema:Person ex:i3 ex:i4 rdf:type ex:i5 rdf:type Scenario 1 where J increases
  • 10. Instance-based Schema Matching using Jaccard Index 10 Or possibly like this: 𝐽(𝐴, 𝐵) = | * ∩ , | | * ∪ , | owl:sameAs* ext~(foaf:Person) = {ex:i1, ex:i2, eq{3,4}} ext~(schema:Person) = {eq{3,4}, ex:i5} J(ext~(foaf:Person), ext~(schema:Person)) = 7 4 = 0.25 ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4} ext(schema:Person) = {ex:i3, ex:i4, ex:i5} J(ext(foaf:Person), ext(schema:Person)) = . / = 0.4 foaf:Person ex:i1 ex:i2 schema:Person ex:i3 ex:i4 rdf:type ex:i5 rdf:type Scenario 2 where J decreases
  • 12. Research Question 1 Does the inclusion of instance-level interlinks positively impact instance-based schema alignments ? 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 12
  • 13. Why should we care? • Providing empirical evidences for schema-matching designers on whether exploiting a large external collection of instance-level interlinks (e.g. from the LOD Cloud) is beneficial for improving the accuracy of schema-matching techniques • Particularly showing the risk/interest of using owl:sameAs after a number of studies suggesting that a large* number of the existing owl:sameAs links in the Web are actually erroneous * 3% of evaluated owl:sameAs are erroneous [Hogan et al., JWS 2012] * 4% of evaluated owl:sameAs are erroneous [Raad et al., ISWC 2018] * 20% of evaluated owl:sameAs are erroneous [Halpin et al., ISWC 2010] 13
  • 14. Research Question 2 How does the quality of the instance-level interlinks impact the quality of the resulting schema-alignments ? 2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of equivalent concepts? 2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 14
  • 16. Dataset 16 (crawled from ~650K datasets in 2015)
  • 17. Dataset 17 # triples 28,362,198,927 # rdf:type statements 3,321,354,308 # rdfs:subClassOf statements 4,461,717 # owl:equivalentClass statements 1,051,979 # explicit owl:sameAs statements 558,943,116 # implicit owl:sameAs statements 35,201,120,188 # equivalence classes (after closure of owl:sameAs) 48,999,148 # concepts with explicit members |C| 833,232 # concepts with explicit or implicit members |C⊑| 976,674
  • 18. Size distribution of the Concepts’ members 18 • 23% of the concepts have one explicit member • 92% of the concepts have ≤ 100 explicit members • 618 concepts have more than 100M explicit or implicit members • 5 concepts have more than 100M explicit members
  • 19. Concepts with more than >100M explicit members 19 Concept Cardinality % http://purl.org/linked-data/cube#Observation 1,306,389,396 39.3 http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry 304,878,654 9.2 http://geovocab.org/geometry#Geometry 167,808,111 5 http://knoesis.wright.edu/ssw/ont/ sensorobservation.owl#MeasureData 144,044,989 4.3 http://xmlns.com/foaf/0.1/Person 132,919,327 4 Total 2,056,040,477 61.9
  • 21. Research Question 1 Does the inclusion of instance-level interlinks impact positively instance-based schema alignments ? 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 21
  • 22. Dataset 22 # triples 28,362,198,927 # rdf:type statements 3,321,354,308 # rdfs:subClassOf statements 4,461,717 # owl:equivalentClass statements 1,051,979 # explicit owl:sameAs statements 558,943,116 # implicit owl:sameAs statements 35,201,120,188 # equivalence classes (after closure of owl:sameAs) 48,999,148 # concepts with explicit members |C| 833,232 # concepts with explicit or implicit members |C⊑| 976,674
  • 23. 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? • 1,051,979 owl:equivalentClass statements in the LOD-a-lot ­ Hypothesis: all these statements are correct alignments ­ Only 972 owl:equivalentClass statements where both concepts have explicit members ­ 208 reflexive alignments (C1, owl:equivalentClass, C1) ­ 22 duplicate symmetric alignments (C1, owl:equivalentClass, C2) and (C2, owl:equivalentClass, C1) ­ 742 alignments between 1,357 distinct concepts (i.e. gold standard) 23
  • 24. 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 24 Size distribution of the Concepts’ members of our Gold Standard
  • 25. 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 25 Jaccard Index distribution for the 742 alignments Compare J(ext⊑(C1), ext⊑(C2)) with J(ext⊑ ~(C1), ext⊑ ~(C2)) such that (C1, owl:equivalentClass, C2) Runtime: <4 hours on 64GB SSD disk
  • 26. 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 26 Jaccard Index variation for the 742 alignments • When owl:sameAs is considered, the Jaccard index increases for 381/ 742 of the correct alignments (52%) • Out of these 381 cases, Jaccard increases from 0 to 1 in 44 cases (7%) • When owl:sameAs is considered, the Jaccard index decreases for 25/ 742 of the correct alignments (3%) • Slight drop in impact when only explicit members are considered owl:sameAs increases the overlap of two equivalent concepts in half of the cases
  • 27. Research Question 1 Does the inclusion of instance-level interlinks impact positively instance-based schema alignments ? 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 27
  • 28. Dataset 28 # triples 28,362,198,927 # rdf:type statements 3,321,354,308 # rdfs:subClassOf statements 4,461,717 # owl:equivalentClass statements 1,051,979 # explicit owl:sameAs statements 558,943,116 # implicit owl:sameAs statements 35,201,120,188 # equivalence classes (after closure of owl:sameAs) 48,999,148 # concepts with explicit members |C| 833,232 # concepts with explicit or implicit members |C⊑| 976,674
  • 29. 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? • 833,232 concepts with explicit members in the LOD-a-lot ­ Created a random alignment for each concept such that each concept is paired only once ­ Hypothesis: all these random alignments are erroneous ­ 416,616 random alignments 29
  • 30. 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 30 Jaccard Index variation for the 416,616 random alignments • When owl:sameAs is considered, the Jaccard index increases for only 94 / 416,616 of the random alignments (0.02%) • When owl:sameAs is considered, the Jaccard index decreases for 3 / 416,616 of the random alignments (~0%) owl:sameAs rarely increases the overlap of two non- equivalent concepts
  • 31. Research Question 2 How does the quality of the instance-level interlinks affect the quality of the resulting schema-alignments ? 2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of equivalent concepts? 2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 31
  • 32. Dataset 32 # triples 28,362,198,927 # rdf:type statements 3,321,354,308 # rdfs:subClassOf statements 4,461,717 # owl:equivalentClass statements 1,051,979 # explicit owl:sameAs statements 558,943,116 # implicit owl:sameAs statements 35,201,120,188 # equivalence classes (after closure of owl:sameAs) 48,999,148 # concepts with explicit members |C| 833,232 # concepts with explicit or implicit members |C⊑| 976,674
  • 33. 2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of equivalent concepts? 33 owl:sameAs quality indicator: • [Raad et al., ISWC 2018] computed an error degree for each owl:sameAs link • Error degree [0.0, 1.0] based on the community structure of the owl:sameAs network and the symmetry of the links • Manual evaluation shows that owl:sameAs links with error degree >0.99 have high probability of being erroneous • Manual evaluation shows that owl:sameAs links with error degree ≤0.4 have high probability of being correct • Transitive closure (a): after discarding the ~1M owl:sameAs link with error degree >0.99 • Transitive closure (b): after discarding the ~150M owl:sameAs link with error degree >0.4
  • 34. 2. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 34 Jaccard Index variation for the 742 alignments • When owl:sameAs with error degree >0.99 are discarded, there is a slight decrease in performance: • Jaccard increases in 51% of the cases (instead of 52%) • Jaccard increases from 0 to 1 in 37 cases (instead of 44) • When owl:sameAs with error degree >0.4 are discarded, there is a big decrease in performance: • Jaccard increases in only 13% of the cases (instead of 52%) • Jaccard increases from 0 to 1 in two cases (instead of 44) • Jaccard decreases in 39 cases (instead of 25)
  • 35. Research Question 2 How does the quality of the instance-level interlinks affect the quality of the resulting schema-alignments ? 2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of equivalent concepts? 2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 35
  • 36. 2. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 36 Jaccard Index variation for the 416,616 random alignments • When owl:sameAs with error degree >0.99 are discarded, there is an increase in performance: • Jaccard increases in 27 of the cases (instead of 94) • When owl:sameAs with error degree >0.4 are discarded, there is an important increase in performance: • Jaccard increases in 2 of the cases (instead of 94)
  • 38. Take away message This work provides an empirical study on the impact of including instance-level interlinks on the overlap between concepts members • Including instance-level interlinks can enhance the performance of instance-based schema alignments ­ Increases the overlap for 52% of the existing alignments in the LOD-a-lot ­ Increases the overlap for less than 0.3% of randomly created alignments • Inference does positively impact instance-based schema alignments ­ Considering also the implicit members enhances the results on the Gold Standard by 3 pp • Discarding only isolated owl:sameAs links in the network can increase the quality of instance-based schema alignments (owl:sameAs links are probably not as bad as we first thought) ­ Reduces the cases where Jaccard index increases for non-equivalent concepts by 71% 38
  • 39. Future Work • Use a larger gold standard to validate the here presented results • Investigate alternative better-tailored instance-based measures to detect new schema alignments at the scale of the Web of data, exploiting: ­ the higher quality subset of the owl:sameAs links; ­ the concepts’ both explicit and implicit 39
  • 40. On the Impact of sameAs on Schema Matching JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH Knowledge Representation & Reasoning Group Thank you for your attention... @joeraad_ Code + Results https://github.com/raadjoe/impact-sameAs-schema-matching