SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
05/26/14 Heiko Paulheim 1
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim
05/26/14 Heiko Paulheim 2
Motivation
• Dataset interlinks can be wrong for many reasons
– Oversimplified heuristic generation (e.g., label equality)
– owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.)
– Concept drift of link targets
• e.g., dbpedia:Prong used to denote a band until DBpedia 3.1
• now it's a disambiguation page
04/08/0812/04/07
<http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .
05/26/14 Heiko Paulheim 3
Overall Idea
• Links between datasets follow certain patterns
– e.g., linking a mo:MusicArtist to a dbo:Artist,
and a mo:MusicalWork to a dbo:Album or a dbo:Song
• Wrong links violate those patterns
• Hence, outlier detection should find wrong links
– Definition: “finding patterns in data that do not conform to the expected
normal behavior” (Chandola et al., 2009)
• Difference over related approaches
– does not require the same schema used in both datasets
– nor schema mappings
– no external/human knowledge required
05/26/14 Heiko Paulheim 4
Projection of Links into Vector Space
• Represent each link as a point in an n-dimensional vector space
– e.g., using their direct types
• Outliers are found in sparse areas
05/26/14 Heiko Paulheim 5
Projection of Links into Vector Space
• Types
– each type of LHS and RHS resource becomes a binary (0/1) feature
– types on both sides are treated separately
• i.e., LHS_foaf:person and RHS_foaf:person
are distinct features
• Properties
– each ingoing/outgoing property of LHS and RHS resource
becomes a binary (0/1) feature
– properties on both sides are treated separately
– ingoing and outgoing properties are treated separately
• i.e., LHS_foaf:based_near, RHS_foaf:based_near,
foaf:based_near_LHS and foaf:based_near_RHS
are all distinct features
• Joint feature set of types and properties
05/26/14 Heiko Paulheim 6
Experiments
• Datasets: link sets between
– BBC Peel Sessions and DBpedia (2,087 links)
– DBTropes and DBpedia (4,229 links)
• Gold standard
– 100 randomly sampled links from each set, manually evaluated
– Peel: 90 out of 100 are correct
– Tropes: 76 out of 100 are correct
• We run outlier detection on the whole link set
– and validate the output only on the gold standard
05/26/14 Heiko Paulheim 8
Experiments
• Outlier Detection Approaches
– assign a score (or label) to each data point
– the higher the score, the likelier it is an outlier
• Evaluation
– Ordering descending by outlier score
– Ideally, all outliers are above all non-outliers
– Plot a ROC curve to measure the quality
• i.e., AUC
– F-Measure
• with best possible threshold
05/26/14 Heiko Paulheim 10
Results
• Type features work better than property features
• LoOP delivers reliably good results
– though not the best
• Best performance on Peel dataset
– CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857)
• Best performance on DBTropes dataset
– LOF (F1 = 0.5, AUC = 0.619)
05/26/14 Heiko Paulheim 11
Results
• ROC curves for Peel dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=10,25,50 identical, LoOP k=25,50 identical
05/26/14 Heiko Paulheim 12
Results
• ROC curves DBTropes dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical,
CBLOF and LDCOF mostly identical
05/26/14 Heiko Paulheim 13
Runtimes
• Most outlier detection algorithms are reasonably fast
– both linksets processed in less than 10 seconds on a normal laptop
• Exceptions:
– clustering (for CBLOF/LDCOF) takes up to 30 seconds
– 1-class SVM takes up to 15 minutes
• ...but creating the feature vector representation
takes much more time
– some hours against public SPARQL endpoint(s)
– reasonably fast with downloaded dumps
05/26/14 Heiko Paulheim 14
Discussion of Results
• Results on Peel dataset better than on DBTropes dataset
• Projection based on types better than on properties
• most likely due to lower dimensionality of vector space
• Peel: #types = 34, #properties = 60
• DBTropes: #types = 81, #properties = 142
• Variation of outlier detection algorithms across datasets
– also observed in other experiments
– general rules of thumb are hard to come up with
05/26/14 Heiko Paulheim 15
Possible Improvements & Future Work
• Other projection methods
– e.g., using numeric counts of relations
• Other outlier detection algorithms
– e.g., Replicating Neural Networks and their generalizations
• Preprocessing
– e.g., Feature Subset Selection
– caveat: the valuable features are often sparse
05/26/14 Heiko Paulheim 16
Possible Improvements & Future Work
• So far, we have looked at owl:sameAs links
• The approach is not limited to that
– should work for other link predicates as well
– e.g., a dataset of persons and a dataset of places
– linked by foaf:based_near
• It is not even limited to linksets
– also for debugging statements inside a knowledge base
– e.g., dbpedia-owl:deathPlace
05/26/14 Heiko Paulheim 17
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim

Mais conteúdo relacionado

Semelhante a Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
gwprice
 

Semelhante a Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection (20)

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
 
Big data elasticsearch practical
Big data  elasticsearch practicalBig data  elasticsearch practical
Big data elasticsearch practical
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Querying Cultural Heritage
Querying Cultural HeritageQuerying Cultural Heritage
Querying Cultural Heritage
 
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)
Shared Print in the Orbis Cascade Alliance and Colorado Alliance (Levine-Clark)
 
The Road to Lambda - Mike Duigou
The Road to Lambda - Mike DuigouThe Road to Lambda - Mike Duigou
The Road to Lambda - Mike Duigou
 
How is research conducted in my field
How is research conducted in my fieldHow is research conducted in my field
How is research conducted in my field
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Morphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometryMorphosyntactic analysis for stylometry
Morphosyntactic analysis for stylometry
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program Committees
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 

Mais de Heiko Paulheim

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Heiko Paulheim
 

Mais de Heiko Paulheim (20)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the Web
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 

Último

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Último (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 

Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

  • 1. 05/26/14 Heiko Paulheim 1 Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection Heiko Paulheim
  • 2. 05/26/14 Heiko Paulheim 2 Motivation • Dataset interlinks can be wrong for many reasons – Oversimplified heuristic generation (e.g., label equality) – owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.) – Concept drift of link targets • e.g., dbpedia:Prong used to denote a band until DBpedia 3.1 • now it's a disambiguation page 04/08/0812/04/07 <http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .
  • 3. 05/26/14 Heiko Paulheim 3 Overall Idea • Links between datasets follow certain patterns – e.g., linking a mo:MusicArtist to a dbo:Artist, and a mo:MusicalWork to a dbo:Album or a dbo:Song • Wrong links violate those patterns • Hence, outlier detection should find wrong links – Definition: “finding patterns in data that do not conform to the expected normal behavior” (Chandola et al., 2009) • Difference over related approaches – does not require the same schema used in both datasets – nor schema mappings – no external/human knowledge required
  • 4. 05/26/14 Heiko Paulheim 4 Projection of Links into Vector Space • Represent each link as a point in an n-dimensional vector space – e.g., using their direct types • Outliers are found in sparse areas
  • 5. 05/26/14 Heiko Paulheim 5 Projection of Links into Vector Space • Types – each type of LHS and RHS resource becomes a binary (0/1) feature – types on both sides are treated separately • i.e., LHS_foaf:person and RHS_foaf:person are distinct features • Properties – each ingoing/outgoing property of LHS and RHS resource becomes a binary (0/1) feature – properties on both sides are treated separately – ingoing and outgoing properties are treated separately • i.e., LHS_foaf:based_near, RHS_foaf:based_near, foaf:based_near_LHS and foaf:based_near_RHS are all distinct features • Joint feature set of types and properties
  • 6. 05/26/14 Heiko Paulheim 6 Experiments • Datasets: link sets between – BBC Peel Sessions and DBpedia (2,087 links) – DBTropes and DBpedia (4,229 links) • Gold standard – 100 randomly sampled links from each set, manually evaluated – Peel: 90 out of 100 are correct – Tropes: 76 out of 100 are correct • We run outlier detection on the whole link set – and validate the output only on the gold standard
  • 7.
  • 8. 05/26/14 Heiko Paulheim 8 Experiments • Outlier Detection Approaches – assign a score (or label) to each data point – the higher the score, the likelier it is an outlier • Evaluation – Ordering descending by outlier score – Ideally, all outliers are above all non-outliers – Plot a ROC curve to measure the quality • i.e., AUC – F-Measure • with best possible threshold
  • 9.
  • 10. 05/26/14 Heiko Paulheim 10 Results • Type features work better than property features • LoOP delivers reliably good results – though not the best • Best performance on Peel dataset – CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857) • Best performance on DBTropes dataset – LOF (F1 = 0.5, AUC = 0.619)
  • 11. 05/26/14 Heiko Paulheim 11 Results • ROC curves for Peel dataset 0 1 0 1 GAS k=10 GAS k=25 GAS k=50 LOF LoOP k=10 LoOP k=25 LoOP k=50 CBLOF LDCOF 1-class SVM Note: GAS k=10,25,50 identical, LoOP k=25,50 identical
  • 12. 05/26/14 Heiko Paulheim 12 Results • ROC curves DBTropes dataset 0 1 0 1 GAS k=10 GAS k=25 GAS k=50 LOF LoOP k=10 LoOP k=25 LoOP k=50 CBLOF LDCOF 1-class SVM Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical, CBLOF and LDCOF mostly identical
  • 13. 05/26/14 Heiko Paulheim 13 Runtimes • Most outlier detection algorithms are reasonably fast – both linksets processed in less than 10 seconds on a normal laptop • Exceptions: – clustering (for CBLOF/LDCOF) takes up to 30 seconds – 1-class SVM takes up to 15 minutes • ...but creating the feature vector representation takes much more time – some hours against public SPARQL endpoint(s) – reasonably fast with downloaded dumps
  • 14. 05/26/14 Heiko Paulheim 14 Discussion of Results • Results on Peel dataset better than on DBTropes dataset • Projection based on types better than on properties • most likely due to lower dimensionality of vector space • Peel: #types = 34, #properties = 60 • DBTropes: #types = 81, #properties = 142 • Variation of outlier detection algorithms across datasets – also observed in other experiments – general rules of thumb are hard to come up with
  • 15. 05/26/14 Heiko Paulheim 15 Possible Improvements & Future Work • Other projection methods – e.g., using numeric counts of relations • Other outlier detection algorithms – e.g., Replicating Neural Networks and their generalizations • Preprocessing – e.g., Feature Subset Selection – caveat: the valuable features are often sparse
  • 16. 05/26/14 Heiko Paulheim 16 Possible Improvements & Future Work • So far, we have looked at owl:sameAs links • The approach is not limited to that – should work for other link predicates as well – e.g., a dataset of persons and a dataset of places – linked by foaf:based_near • It is not even limited to linksets – also for debugging statements inside a knowledge base – e.g., dbpedia-owl:deathPlace
  • 17. 05/26/14 Heiko Paulheim 17 Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection Heiko Paulheim