DBpedia is a large-scale, cross-domain knowledge graph extracted from Wikipedia. For the extraction, crowd-sourced mappings from Wikipedia infoboxes to the DBpedia ontology are utilized. In this process, different problems may arise: users may create wrong and/or inconsistent mappings, use the ontology in an unforeseen way, or change the ontology without considering all possible consequences. In this paper, we present a data-driven approach to discover problems in mappings as well as in the ontology and its usage in a joint, data-driven process. We show both quantitative and qualitative results about the problems identified, and derive proposals for altering mappings and refactoring the DBpedia ontology.
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
1. 06/01/17 Heiko Paulheim 1
Data-driven Joint Debugging
of the DBpedia Mappings and Ontology
Towards Addressing the Causes
instead of the Symptoms of Data Quality in DBpedia
Heiko Paulheim
2. 06/01/17 Heiko Paulheim 2
Motivation
• Various works on finding errors in Knowledge Graphs
– 2017 survey: 17 approaches
– 15/17 are evaluated on DBpedia
• Question:
– How does DBpedia benefit
from those works?
H. Paulheim: Knowledge Graph Refinement – A Survey
of Approaches and Evaluation Methods. SWJ 8(3), 2017
3. 06/01/17 Heiko Paulheim 3
Motivation
• What comes out of those research works
– A list of (possibly) wrong statements
– Source code for finding erroneous statements
– ...
4. 06/01/17 Heiko Paulheim 4
Motivation
• Possible option 1: Remove erroneous triples from DBpedia
• Challenges
– May remove correct axioms, may need thresholding
– Needs to be repeated for each release
– Needs to be materialized on all of DBpedia
DBpedia
Extraction
FrameworkWikipedia
DBpedia Mappings Wiki
Post
Filter
6. 06/01/17 Heiko Paulheim 6
Motivation
• Possible option 2: Integrate into DBpedia Extraction Framework
• Challenges
– Development workload
– Some approaches are not fully automated (technically or conceptually)
– Scalability
DBpedia
Extraction
Framework
plus filter
module
Wikipedia
DBpedia Mappings Wiki
7. 06/01/17 Heiko Paulheim 7
Motivation
• Scalability analyzed: 6/15
Disclaimer: does not imply
that it is actually scalable!
8. 06/01/17 Heiko Paulheim 8
Motivation
• Do we have a third option?
– Paulheim & Gangemi (2015): >95% of all inconsistencies in DBpedia
boil down to 40 common root causes
Wikipedia
DBpedia Mappings Wiki
DBpedia
Extraction
Framework
Inconsistency
DetectionIdentification
of suspicious
mappings and
ontology
constructs
H. Paulheim, A. Gangemi: Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top (ISWC 2015)
Disclaimer: not equivalent to
“wrong statements”
10. 06/01/17 Heiko Paulheim 10
Approach
• Find inconsistencies in extracted statements
– Using DBpedia and DOLCE as top level ontology
• Trace them back to mappings
– In the example, there are three candidates
• Property mapping to the predicate dbo:operator
• Class mapping (subject) to dbo:Airport
• Class mapping (object) to dbo:Settlement
• Unfortunately, provenance information for DBpedia
is not that fine-grained
– i.e., we do not know which mapping was responsible for which
statement in the end
– first step: heuristic reconstruction
11. 06/01/17 Heiko Paulheim 11
Approach: Identifying Mapping Elements
[1] Dimou et al.: DBpedia Mappings Quality Assessment (ISWC Poster 2016)
Wikipedia Page
DBpedia Resource
• We use the RML representation of the Mapping Wiki contents [1]
https://www.w3.org/TR/r2rml/
12. 06/01/17 Heiko Paulheim 12
Approach: Identifying Mapping Elements
[1] Dimou et al.: DBpedia Mappings Quality Assessment (ISWC Poster 2016)
DBpedia Ontology
Class
• We use the RML representation of the Mapping Wiki contents [1]
https://www.w3.org/TR/r2rml/
13. 06/01/17 Heiko Paulheim 13
Approach: Identifying Mapping Elements
[1] Dimou et al.: DBpedia Mappings Quality Assessment (ISWC Poster 2016)
DBpedia Ontology
Property
• We use the RML representation of the Mapping Wiki contents [1]
https://www.w3.org/TR/r2rml/
14. 06/01/17 Heiko Paulheim 14
Approach (ctd.)
• After we heuristically reconstructed the mappings, we can determine
– How often is a mapping element involved in an inconsistency?
– How often is a mapping element used, but not involved in an
inconsistency?
15. 06/01/17 Heiko Paulheim 15
Approach (ctd.)
• Using the two counters cm
and im
, we can compute two scores
for the hypothesis that m is problematic
• Borrowed from Association Rule Mining (support and confidence):
• N is the total number of statements in DBpedia
16. 06/01/17 Heiko Paulheim 16
Identifying Interesting Problems
• Hypothesis: high support and high confidence mapping elements
hint at problems worth investigating
– High support: fixing the issue would fix a lot of individual statements
– High confidence: this mapping element actually hints at the root cause
• i.e., fixing this does not break many other things
• Unfortunately, both come at different scales
– Difficult to use average, harmonic mean or the like
– Support: μ = 0.0002, σ = 0.003
– Confidence: μ = 0.114, σ = 0.260
• Fix: use logarithmic support instead
– LogSupport: μ = 0.179, σ = 0.139
17. 06/01/17 Heiko Paulheim 17
Identifying Interesting Problems (ctd.)
• Inspect mappings that have a high harmonic mean of
confidence and log support
0.25 0.5 0.75
more interesting
18. 06/01/17 Heiko Paulheim 18
Example Findings
• Case 1: Mapping to wrong property
• Example:
– branch in infobox military unit
is mapped to dbo:militaryBranch
• but dbo:militaryBranch
has dbo:Person as its domain
– correction: dbo:commandStructure
– Overall score: 0.721
– Affects 12,172 statements
(31% of all dbo:militaryBranch)
19. 06/01/17 Heiko Paulheim 19
Example Findings
• Case 2: Mappings that should be removed
• Example:
– dbo:picture
– Most of the are inconsistent (64.5% places, 23.0% persons)
– Reason: statements are extracted from picture caption
dbo:Brixton_Academy
dbo:picture
dbo:Brixton .
dbo:Justify_My_Love
dbo:picture
dbo:Madonna_(entertainer) .
20. 06/01/17 Heiko Paulheim 20
Example Findings
• Case 3: Ontology problems (domain/range)
• Example 1:
– Populated places (e.g., cities) are used both as place and organization
– For some properties, the range is either one of the two
• e.g., dbo:operator (see introductory example)
– Polysemy should be reflected in the ontology
• Example 2:
– dbo:architect, dbo:designer, dbo:engineer etc.
have dbo:Person as their range
– Significant fractions (8.6%, 7.6%, 58.4%, resp.)
have a dbo:Organization as object
– Range should be broadened
21. 06/01/17 Heiko Paulheim 21
Example Findings
• Case 4: Missing properties
• Example 1:
– dbo:president links an organization to its president
– Majority use (8,354, or 76.2%):
link a person to the president s/he served for
• Example 2:
– dbo:instrument links an artist
to the instrument s/he plays
– Prominent alternative use (3,828, or 7.2%):
links a genre to its characteristic instrument
Obamaexamplealert!
22. 06/01/17 Heiko Paulheim 22
Future Work
• Classify ontology, mapping, and other errors automatically
– Currently ongoing: using different language editions of DBpedia
• Heuristic:
– problem present in many languages → ontology problem
– Problem present only in one language → mapping problem
• From post-processing to live processing
– e.g., on-the-fly validation in DBpedia Mappings Wiki
23. 06/01/17 Heiko Paulheim 23
Take Aways
• Fixing bugs in knowledge graphs is nice
– But often a one-time solution
– Preserving the efforts is hard
• Proposed solution
– Identify and address the root problem
– Scoring mechanism helps
identifying interesting problems
– Preserving the efforts by eliminating
the root causes
• Provenance matters!
– The more we know about how a statement
gets into a knowledge graph
– The better can we automate the error analysis
24. 06/01/17 Heiko Paulheim 24
Data-driven Joint Debugging
of the DBpedia Mappings and Ontology
Towards Addressing the Causes
instead of the Symptoms of Data Quality in DBpedia
Heiko Paulheim