SlideShare uma empresa Scribd logo
1 de 19
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

1
The Problem
•

One promise of the Semantic Web:
– You can issue structured queries
– e.g., „List all presidents that graduated from Harvard Law School“
– SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

10/31/13

Heiko Paulheim, Christian Bizer

2
The Problem
•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

...if we run this against DBpedia, we get one result
– i.e., Elwell Stephen Otis

•

But...

10/31/13

Heiko Paulheim, Christian Bizer

3
The Problem

10/31/13

Heiko Paulheim, Christian Bizer

4
The Problem
•

So what is going wrong?

•

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

•

In DBpedia, Barack Obama is not of type President!

•

How can we add missing types?

10/31/13

Heiko Paulheim, Christian Bizer

5
Is It a Big Problem?
•

DBpedia has at least 2.7 million missing type statements
– w.r.t. the DBpedia ontology
– found using co-occurence analysis of matching classes
in YAGO and DBpedia
– a very optimistic lower bound

•

Highly incomplete classes:
– Species: >870,000 missing statements
– Person: >510,000 missing statements
– Event: >150,000 missing statements

10/31/13

Heiko Paulheim, Christian Bizer

6
A Naive Approach
•

Idea: exploit properties with domain and range

•

Pseudo RDFS Reasoning:
– CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION
{?y ?r ?x . ?r rdfs:range ?t} }

10/31/13

Heiko Paulheim, Christian Bizer

7
A Naive Approach
•

Experiment with Barack Obama
– Person, PersonFunction, Actor, Organization

•

Experiment with Germany:
– Place, Award, Populated Place, City, SportsTeam, Mountain, Agent,
Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company,
EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion,
Language, MilitaryConflict, Settlement, RouteOfTransportation

10/31/13

Heiko Paulheim, Christian Bizer

8
A Naive Approach
•

What is going on here?
– DBpedia data is noisy
– One wrong statement is enough for a wrong conclusion
– e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany

•

Germany example: 69,000 statements
– 20 wrong types can come from 20 wrong statements
– i.e., an error rate of 0.03% is enough for a totally screwed result
– ...but that would be an excellent data quality for a LOD source!

10/31/13

Heiko Paulheim, Christian Bizer

9
SDType Approach
•

Idea: outgoing/incoming properties are indicators
for a resource's type
– e.g.: starring → Movie
– e.g.: author-1 → Writer

•

Basic compiled statistics
– P(C|p) := probability of class C in presence of property p
– e.g.: P(dbpedia:Film|starring) = 0.79
– e.g.: P(dbpedia:Writer|author-1) = 0.44

10/31/13

Heiko Paulheim, Christian Bizer

10
SDType Approach
•

Based on precompiled statistics
– Find types of instances
– Using voting

•

score(C) = avg(all properties p) P(C|p)

•

Refinement:
– Weight for properties: discriminative power
– weight(p) = sum(all classes c) (p(c)-p(c|p))²
– i.e., how strongly this property's class distribution
deviates from the overall class distribution

10/31/13

Heiko Paulheim, Christian Bizer

11
Evaluation
•

Two fold evaluation
– On DBpedia and OpenCyc as „Silver Standard“
(automatic, 10,000 random instances)
– On untyped DBpedia resources (manual, 100 instances)

•

Using only incoming properties
– Using outgoing properties is trivial!

10/31/13

Heiko Paulheim, Christian Bizer

12
Evaluation Results
•

On DBpedia

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13

Heiko Paulheim, Christian Bizer

13
Evaluation Results
•

On OpenCyc

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13

Heiko Paulheim, Christian Bizer

14
Evaluation Results
•

Evaluation on untyped resources
– Random sample of 100 untyped resources
– Manual checking of precision

1

12

0.9
10

0.8
0.7
Precision

0.6
0.5

6

0.4
4

0.3
0.2

# found types

8
# found
types
precision

2

0.1
0

0
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Lower bound for threshold

10/31/13

Heiko Paulheim, Christian Bizer

15
Evaluation Results
•

DBpedia:
– works reasonably well (F-measure 0.89)

•

OpenCyc:
– harder because of deeper class hierarchy (F-measure 0.60)

•

General:
– having more links increases precision
(in contrast to RDFS reasoning)
– more general types (e.g., Band) are easier than specific ones
(e.g., PunkRockBand)

10/31/13

Heiko Paulheim, Christian Bizer

16
Deployment
•

Heuristic types have been included in DBpedia 3.9
– for previously untyped instances
– 3.4 million type statements at precision ~0.95

•

Includes also many resources without a Wikipedia page
– i.e., generated from a red link

•

Runtime
– Complexity O(PT)
P: number of property assertions
T: number of type assertions
– ~24h for processing DBpedia

10/31/13

Heiko Paulheim, Christian Bizer

17
Conclusion and Outlook
•

SDType approach works at high quality
– outperforms most state of the art on DBpedia
– deployed for DBpedia 3.9

•

Same approach can be used for
– validating links
– within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)
– across datasets: to be done

10/31/13

Heiko Paulheim, Christian Bizer

18
Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

19

Mais conteúdo relacionado

Mais procurados

Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Heiko Paulheim
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Heiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web ObservatoriesSteffen Staab
 
20150415 keynote open DIET 2015
20150415 keynote open DIET 201520150415 keynote open DIET 2015
20150415 keynote open DIET 2015fpilotti
 

Mais procurados (11)

Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web Observatories
 
20150415 keynote open DIET 2015
20150415 keynote open DIET 201520150415 keynote open DIET 2015
20150415 keynote open DIET 2015
 

Semelhante a Type Inference on Noisy RDF Data

Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
Radically Open at the National Archives
Radically Open at the National ArchivesRadically Open at the National Archives
Radically Open at the National ArchivesJon Voss
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesPetar Ristoski
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11William Ulate
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningHeiko Paulheim
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project ClinicWiLS
 
Downsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your AdministrationDownsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your AdministrationChristopher Brown
 
Magnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleMagnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleAlex Dorman
 
Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011Jon Voss
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsJon Voss
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...Heiko Paulheim
 
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data FusionLearning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data FusionVolha Bryl
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...Open Knowledge Maps
 
Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2SCC Library
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsJon Voss
 

Semelhante a Type Inference on Noisy RDF Data (20)

Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Radically Open at the National Archives
Radically Open at the National ArchivesRadically Open at the National Archives
Radically Open at the National Archives
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spaces
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project Clinic
 
Downsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your AdministrationDownsizing Your Depository: Dealing with Mandates from Your Administration
Downsizing Your Depository: Dealing with Mandates from Your Administration
 
Magnetic - Query Categorization at Scale
Magnetic - Query Categorization at ScaleMagnetic - Query Categorization at Scale
Magnetic - Query Categorization at Scale
 
Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011Civil War Data 150 at DLF Fall Forum 2011
Civil War Data 150 at DLF Fall Forum 2011
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
 
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data FusionLearning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
 
Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2Eng102 stevenson fall15_m_bdraft2
Eng102 stevenson fall15_m_bdraft2
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
 

Mais de Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesHeiko Paulheim
 

Mais de Heiko Paulheim (12)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Extending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List PagesExtending DBpedia with Wikipedia List Pages
Extending DBpedia with Wikipedia List Pages
 

Último

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Último (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Type Inference on Noisy RDF Data

  • 1. Type Inference on Noisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 1
  • 2. The Problem • One promise of the Semantic Web: – You can issue structured queries – e.g., „List all presidents that graduated from Harvard Law School“ – SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } 10/31/13 Heiko Paulheim, Christian Bizer 2
  • 3. The Problem • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • ...if we run this against DBpedia, we get one result – i.e., Elwell Stephen Otis • But... 10/31/13 Heiko Paulheim, Christian Bizer 3
  • 5. The Problem • So what is going wrong? • SELECT ?x WHERE { ?x a dbpedia-owl:President . ?x dbpedia-owl:almaMater dbpedia:Harvard_Law_School } • In DBpedia, Barack Obama is not of type President! • How can we add missing types? 10/31/13 Heiko Paulheim, Christian Bizer 5
  • 6. Is It a Big Problem? • DBpedia has at least 2.7 million missing type statements – w.r.t. the DBpedia ontology – found using co-occurence analysis of matching classes in YAGO and DBpedia – a very optimistic lower bound • Highly incomplete classes: – Species: >870,000 missing statements – Person: >510,000 missing statements – Event: >150,000 missing statements 10/31/13 Heiko Paulheim, Christian Bizer 6
  • 7. A Naive Approach • Idea: exploit properties with domain and range • Pseudo RDFS Reasoning: – CONSTRUCT {?x a ?t} WHERE { {?x ?r ?y . ?r rdfs:domain ?t} UNION {?y ?r ?x . ?r rdfs:range ?t} } 10/31/13 Heiko Paulheim, Christian Bizer 7
  • 8. A Naive Approach • Experiment with Barack Obama – Person, PersonFunction, Actor, Organization • Experiment with Germany: – Place, Award, Populated Place, City, SportsTeam, Mountain, Agent, Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company, EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion, Language, MilitaryConflict, Settlement, RouteOfTransportation 10/31/13 Heiko Paulheim, Christian Bizer 8
  • 9. A Naive Approach • What is going on here? – DBpedia data is noisy – One wrong statement is enough for a wrong conclusion – e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany • Germany example: 69,000 statements – 20 wrong types can come from 20 wrong statements – i.e., an error rate of 0.03% is enough for a totally screwed result – ...but that would be an excellent data quality for a LOD source! 10/31/13 Heiko Paulheim, Christian Bizer 9
  • 10. SDType Approach • Idea: outgoing/incoming properties are indicators for a resource's type – e.g.: starring → Movie – e.g.: author-1 → Writer • Basic compiled statistics – P(C|p) := probability of class C in presence of property p – e.g.: P(dbpedia:Film|starring) = 0.79 – e.g.: P(dbpedia:Writer|author-1) = 0.44 10/31/13 Heiko Paulheim, Christian Bizer 10
  • 11. SDType Approach • Based on precompiled statistics – Find types of instances – Using voting • score(C) = avg(all properties p) P(C|p) • Refinement: – Weight for properties: discriminative power – weight(p) = sum(all classes c) (p(c)-p(c|p))² – i.e., how strongly this property's class distribution deviates from the overall class distribution 10/31/13 Heiko Paulheim, Christian Bizer 11
  • 12. Evaluation • Two fold evaluation – On DBpedia and OpenCyc as „Silver Standard“ (automatic, 10,000 random instances) – On untyped DBpedia resources (manual, 100 instances) • Using only incoming properties – Using outgoing properties is trivial! 10/31/13 Heiko Paulheim, Christian Bizer 12
  • 13. Evaluation Results • On DBpedia 1 0.9 0.8 Precision 0.7 0.6 min. 1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 13
  • 14. Evaluation Results • On OpenCyc 1 0.9 0.8 Precision 0.7 0.6 min. 1 link min. 10 links min. 25 links 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 10/31/13 Heiko Paulheim, Christian Bizer 14
  • 15. Evaluation Results • Evaluation on untyped resources – Random sample of 100 untyped resources – Manual checking of precision 1 12 0.9 10 0.8 0.7 Precision 0.6 0.5 6 0.4 4 0.3 0.2 # found types 8 # found types precision 2 0.1 0 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Lower bound for threshold 10/31/13 Heiko Paulheim, Christian Bizer 15
  • 16. Evaluation Results • DBpedia: – works reasonably well (F-measure 0.89) • OpenCyc: – harder because of deeper class hierarchy (F-measure 0.60) • General: – having more links increases precision (in contrast to RDFS reasoning) – more general types (e.g., Band) are easier than specific ones (e.g., PunkRockBand) 10/31/13 Heiko Paulheim, Christian Bizer 16
  • 17. Deployment • Heuristic types have been included in DBpedia 3.9 – for previously untyped instances – 3.4 million type statements at precision ~0.95 • Includes also many resources without a Wikipedia page – i.e., generated from a red link • Runtime – Complexity O(PT) P: number of property assertions T: number of type assertions – ~24h for processing DBpedia 10/31/13 Heiko Paulheim, Christian Bizer 17
  • 18. Conclusion and Outlook • SDType approach works at high quality – outperforms most state of the art on DBpedia – deployed for DBpedia 3.9 • Same approach can be used for – validating links – within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements) – across datasets: to be done 10/31/13 Heiko Paulheim, Christian Bizer 18
  • 19. Type Inference on Noisy RDF Data 10/31/13 Paulheim, Christian Bizer Heiko Paulheim, Christian Bizer Heiko 19