SlideShare uma empresa Scribd logo
1 de 12
DBpediaNYD –
A Silver Standard Benchmark Dataset
for Semantic Relatedness in DBpedia

10/22/13 Paulheim Heiko Paulheim
Heiko

1
Motivation
•

There are quite a few approaches to entity ranking/
statement weighting on Linked Data
– and DBpedia in particular

•

Examples:
– Franz et al. (2009) – Tensor Decomposition
– Meij et al. (2009) – Machine Learning
– Mirizzi et al. (2010) – Web Search Engines
– Mulay and Kumar (2011) – Machine Learning
– Hees et al. (2012) – Crowd Sourcing
– Nunes et al. (2012) – Social Network Analysis

10/22/13

Heiko Paulheim

2
Motivation
•

However,
– none of those have been competitively evaluated
– none of those have been evaluated at large scale

•

Evaluation with
– small private data sets
– user studies

•

Approaches using Machine Learning
– requires training data
– expensive to obtain

10/22/13

Heiko Paulheim

3
The Dataset
•

Large-scale dataset (several thousand instances)
– statements with strengths

•

Strength value: Normalized Google Distance

•

f(x): number of search results containing x

•

f(x,y): number of search results containing both x and y

•

M: number of pages in search engine index

•

NGD has been shown to correlate with human strength associations

10/22/13

Heiko Paulheim

4
The Dataset
•

NGD is a symmetric value
– NYD dataset also contains asymmetric values

•

Asymmetric Normalized Google Distance

•

f(x): number of search results containing x

•

f(x,y): number of search results containing both x and y

•

M: number of pages in search engine index

10/22/13

Heiko Paulheim

5
Constructing the Dataset
•

We sampled 10,000 statements
– with DBpedia resources as subject and object
(e.g., no type statements, no literals)
– with dbpedia or dbpprop predicate

•

...and computed symmetric/asymmetric NGD
– using the labels as search strings
– using Yahoo BOSS

10/22/13

Heiko Paulheim

6
The Dataset
•

Random sample of 10,000 statements
– i.e., 30,000 search engine calls (80c/1,000 → 24 USD)

•

3,058 pairs of resources had to be discarded
– f(x)<f(x,y) or f(y)<f(x,y)
– search engines sometimes don't count properly :-(

•

Result:
– 6,942 weighted statements (symmetric)
– 13,884 weighted statements (asymmetric)

10/22/13

Heiko Paulheim

7
The Dataset
•

Example:
– dbpedia:John_Lennon and dbpedia:Yoko_Ono

•

Distances:
– symmetric: 0.18
– John Lennon → Yoko Ono 0.18
– Yoko Ono → John Lennon 0.03

•

Explanation:
– Yoko Ono is famous for being John Lennon's wife
• and most often mentioned in that context
– John Lennon is more famous for being a member of the Beatles

10/22/13

Heiko Paulheim

8
Example: the DBpedia FindRelated Service
•

We trained two regression SVMs (LibSVM) based on DBpediaNYD
– one for symmetric, one for asymmetric
– service allows for finding the most related among the linked resources

•

Example results:

•

http://wiki.dbpedia.org/FindRelated

10/22/13

Heiko Paulheim

9
Conclusion and Outlook
•

DBpediaNYD allows for large scale evaluation
– rather a silver standard
– does not replace manually created gold standards

•

Future work
– validate DBpediaNYD with users
– compare search engines

10/22/13

Heiko Paulheim

10
Something Completely Different
•

Challenges enumerated in the workshop intro this morning
– “Logical inference on noisy data”

•

Talk on “Type Inference on Noisy RDF Data”
– Was actually applied for DBpedia 3.9
– Friday, 3:15, Bayside 204A

10/22/13

Heiko Paulheim

11
DBpediaNYD –
A Silver Standard Benchmark Dataset
for Semantic Relatedness in DBpedia

10/22/13 Paulheim Heiko Paulheim
Heiko

12

Mais conteúdo relacionado

Mais procurados

Mais procurados (6)

Similarity: Retrieving Documents
Similarity: Retrieving DocumentsSimilarity: Retrieving Documents
Similarity: Retrieving Documents
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
Connections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedConnections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystified
 
Freedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseFreedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations arise
 
PhyloTastic: names-based phyloinformatic data integration
PhyloTastic: names-based phyloinformatic data integrationPhyloTastic: names-based phyloinformatic data integration
PhyloTastic: names-based phyloinformatic data integration
 
Dbd arrrrcamp-2013
Dbd arrrrcamp-2013Dbd arrrrcamp-2013
Dbd arrrrcamp-2013
 

Destaque

Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
ADBS
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
GUANGYUAN PIAO
 
Requêtes sparql
Requêtes sparqlRequêtes sparql
Requêtes sparql
FipBast
 
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
ADBS
 

Destaque (9)

Using DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationUsing DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data Integration
 
Portails documentaires et référentiels du Web sémantique : exemples et enjeu...
Portails documentaires et  référentiels du Web sémantique : exemples et enjeu...Portails documentaires et  référentiels du Web sémantique : exemples et enjeu...
Portails documentaires et référentiels du Web sémantique : exemples et enjeu...
 
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
 
Requêtes sparql
Requêtes sparqlRequêtes sparql
Requêtes sparql
 
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
Les quatre aveugles et l'éléphant web, ou les chroniques d'un web non documen...
 
Lancement de Semanticpédia et DBpédia.fr
Lancement de Semanticpédia et DBpédia.frLancement de Semanticpédia et DBpédia.fr
Lancement de Semanticpédia et DBpédia.fr
 
Thérèse Libourel, atelier Ontologies avec Protégé
Thérèse Libourel, atelier Ontologies avec ProtégéThérèse Libourel, atelier Ontologies avec Protégé
Thérèse Libourel, atelier Ontologies avec Protégé
 
Thérèse Libourel, Ontologies en SHS, 2015-11-09, Tours
Thérèse Libourel, Ontologies en SHS, 2015-11-09, ToursThérèse Libourel, Ontologies en SHS, 2015-11-09, Tours
Thérèse Libourel, Ontologies en SHS, 2015-11-09, Tours
 

Semelhante a DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresher
Tamir Dresher
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
Neeraj Tewari
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
Gezim Sejdiu
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Lucidworks
 

Semelhante a DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia (20)

Data_Science.ppt
Data_Science.pptData_Science.ppt
Data_Science.ppt
 
Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresher
 
Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresher
 
Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresher
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Research Lifecycles and RDM
Research Lifecycles and RDMResearch Lifecycles and RDM
Research Lifecycles and RDM
 
Quettra Design Problem Solution - Deepti Chafekar
Quettra Design Problem Solution - Deepti ChafekarQuettra Design Problem Solution - Deepti Chafekar
Quettra Design Problem Solution - Deepti Chafekar
 
DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spaces
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
 
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.pptweek1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
week1-thursday-2id50-q2-2021-2022-intro-and-basic-fd.ppt
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
 
Datamininglecture
DatamininglectureDatamininglecture
Datamininglecture
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
data mining
data miningdata mining
data mining
 

Mais de Heiko Paulheim

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Heiko Paulheim
 

Mais de Heiko Paulheim (20)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the Web
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

  • 1. DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia 10/22/13 Paulheim Heiko Paulheim Heiko 1
  • 2. Motivation • There are quite a few approaches to entity ranking/ statement weighting on Linked Data – and DBpedia in particular • Examples: – Franz et al. (2009) – Tensor Decomposition – Meij et al. (2009) – Machine Learning – Mirizzi et al. (2010) – Web Search Engines – Mulay and Kumar (2011) – Machine Learning – Hees et al. (2012) – Crowd Sourcing – Nunes et al. (2012) – Social Network Analysis 10/22/13 Heiko Paulheim 2
  • 3. Motivation • However, – none of those have been competitively evaluated – none of those have been evaluated at large scale • Evaluation with – small private data sets – user studies • Approaches using Machine Learning – requires training data – expensive to obtain 10/22/13 Heiko Paulheim 3
  • 4. The Dataset • Large-scale dataset (several thousand instances) – statements with strengths • Strength value: Normalized Google Distance • f(x): number of search results containing x • f(x,y): number of search results containing both x and y • M: number of pages in search engine index • NGD has been shown to correlate with human strength associations 10/22/13 Heiko Paulheim 4
  • 5. The Dataset • NGD is a symmetric value – NYD dataset also contains asymmetric values • Asymmetric Normalized Google Distance • f(x): number of search results containing x • f(x,y): number of search results containing both x and y • M: number of pages in search engine index 10/22/13 Heiko Paulheim 5
  • 6. Constructing the Dataset • We sampled 10,000 statements – with DBpedia resources as subject and object (e.g., no type statements, no literals) – with dbpedia or dbpprop predicate • ...and computed symmetric/asymmetric NGD – using the labels as search strings – using Yahoo BOSS 10/22/13 Heiko Paulheim 6
  • 7. The Dataset • Random sample of 10,000 statements – i.e., 30,000 search engine calls (80c/1,000 → 24 USD) • 3,058 pairs of resources had to be discarded – f(x)<f(x,y) or f(y)<f(x,y) – search engines sometimes don't count properly :-( • Result: – 6,942 weighted statements (symmetric) – 13,884 weighted statements (asymmetric) 10/22/13 Heiko Paulheim 7
  • 8. The Dataset • Example: – dbpedia:John_Lennon and dbpedia:Yoko_Ono • Distances: – symmetric: 0.18 – John Lennon → Yoko Ono 0.18 – Yoko Ono → John Lennon 0.03 • Explanation: – Yoko Ono is famous for being John Lennon's wife • and most often mentioned in that context – John Lennon is more famous for being a member of the Beatles 10/22/13 Heiko Paulheim 8
  • 9. Example: the DBpedia FindRelated Service • We trained two regression SVMs (LibSVM) based on DBpediaNYD – one for symmetric, one for asymmetric – service allows for finding the most related among the linked resources • Example results: • http://wiki.dbpedia.org/FindRelated 10/22/13 Heiko Paulheim 9
  • 10. Conclusion and Outlook • DBpediaNYD allows for large scale evaluation – rather a silver standard – does not replace manually created gold standards • Future work – validate DBpediaNYD with users – compare search engines 10/22/13 Heiko Paulheim 10
  • 11. Something Completely Different • Challenges enumerated in the workshop intro this morning – “Logical inference on noisy data” • Talk on “Type Inference on Noisy RDF Data” – Was actually applied for DBpedia 3.9 – Friday, 3:15, Bayside 204A 10/22/13 Heiko Paulheim 11
  • 12. DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia 10/22/13 Paulheim Heiko Paulheim Heiko 12