Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Using SPARQL and SPIN for Data Quality Management on the Semantic Web
1. Using SPARQL and SPIN for
Data Quality Management
on the Semantic Web
Christian Fürber / Martin Hepp
christian@fuerber.com, mhepp@computer.org
Presentation @ BIS
May 4th 2010
3. Growth of Data: Retrieving
information
Well on Track…
Building smart
Reference: http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html
SemWeb apps
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 3
4. …but what if the published data was of
poor quality?
Get a giant
camcorder
from
amazon!
C. Fürber, M. Hepp: 4
Using SPARQL and SPIN for Data Quality
Management on the Semantic Web
5. Using Poor Data is Costly
Without quality checks your SemWeb Apps will
take this data seriously and…
…get an oversized shipping
package with expensive postage,
…and waste transportation capacity.
C. Fürber, M. Hepp: 5
Using SPARQL and SPIN for Data Quality
Management on the Semantic Web
6. Is there any way to avoid data
quality disasters?
Yes, if we know about data quality
problems, before anything bad will
happen!
A giant
camcorder on
the road!
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 6
7. The Impact of Poor Data Quality
Higher Costs
Missed Revenues
Poor Decisions
Lower Product /
Failed Business Processes Service Quality
Failed Projects Lower Stakeholder
Satisfaction
Fatal Disasters
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 7
8. Data Quality is a Key Bottleneck of the
Unique value violation
Semantic Web
<vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1">
<vocab:location_ZIP></vocab:location_ZIP> Missing literal values
<vocab:location_STREETNO></vocab:location_STREETNO>
<vocab:location_COUNTRY>France</vocab:location_COUNTRY>
<vocab:location_ID rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</vocab:location_ID>
<vocab:location_STREET>8489 Strong St.</vocab:location_STREET>
<vocab:location_STATE>NV</vocab:location_STATE>
<rdfs:label>location #1</rdfs:label> Functional dependency
violation
<vocab:location_CITY>Las Vegas</vocab:location_CITY>
</vocab:location>
Syntax violation
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 8
9. <vocab:location rdf:about="http://www.stockdbdemo2.com/stockdblocation/1">
<vocab:location_ZIP></vocab:location_ZIP>
Our Approach <vocab:location_STREETNO></vocab:location_STREETNO>
<vocab:location_COUNTRY>France</vocab:location_COUNTRY>
<vocab:location_ID rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</vocab:location_ID>
<vocab:location_STREET>8489 Strong St.</vocab:location_STREET>
<vocab:location_STATE>NV</vocab:location_STATE>
<rdfs:label>location #1</rdfs:label>
<vocab:location_CITY>Las Vegas</vocab:location_CITY>
</vocab:location>
Identification of data quality problems on
instance level of Semantic Web sources
solely with Semantic Web technologies.
Integration advantages
Access to SemWeb data may be
useful for dqm.
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 9
10. Proposed Architecture
SPARQL + SPIN Query Layer
Domain- SPIN
Ontology Ontology Layer
OBDQM
Data Sources Layer
Knowledge
Linked
RDB Base Data Cloud
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 10
11. Defining Data Quality Rules with
SPARQL (1)
Define what is allowed and negate it.
Define what is not allowed.
Negations and regular expressions save manual
effort.
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 11
12. Defining Data Quality Rules with
SPARQL (2)
The city „Las Vegas“ must be in the country „USA“.
# Checking functional dependency of {?arg4} with {?arg2}
CONSTRUCT {
_:b0 a spin:ConstraintViolation .
_:b0 spin:violationRoot ?this .
_:b0 spin:violationPath vocab:location_COUNTRY .
}
WHERE {
?this vocab:location_CITY „Las Vegas“ .
FILTER (!spl:hasValue(?this, vocab:location_COUNTRY, “USA”)) .
}
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 12
13. Defining Data Quality Rules with
SPARQL (3)
High reusability of data quality rules through SPIN‘s
SPARQL query templates.
# Checking functional dependency of {?arg4} with {?arg2}
CONSTRUCT {
_:b0 a spin:ConstraintViolation .
_:b0 spin:violationRoot ?this .
_:b0 spin:violationPath ?arg3 .
}
WHERE {
?this ?arg1 ?arg2 .
FILTER (!spl:hasValue(?this, ?arg3, ?arg4)) .
}
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 13
14. Enforced DQ-Rules with SPIN
Application: http://www.topquadrant.com/products/TB_Composer.html#free
C. Fürber, M. Hepp: 14
Using SPARQL and SPIN for Data Quality
Management on the Semantic Web
15. More Data Quality Rule Templates (1)
Data Quality Problem SPARQL Query Template
Missing literal values ASK WHERE {
?this ?arg1 "" .
}
Out of range value ASK WHERE {
?this ?arg1 ?value .
(lower limit) FILTER (?value < ?arg2) .
}
Out of range value ASK WHERE {
?this ?arg1 ?value .
(upper limit) FILTER (?value > ?arg2) .
}
Global Ontology
Knowledge
RDB RDB Base
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 15
16. More Data Quality Rule Templates (2)
Data Quality Problem SPARQL Query Template
Syntax violation ASK WHERE {
?this ?arg1 ?value .
(only letters and dots FILTER (!regex(str(?value),
allowed) "^([A-Za-z,. ])*$"))}
Unique value violation CONSTRUCT {
_:b0 a spin:ConstraintViolation .
_:b0 spin:violationRoot ?a .
_:b0 spin:violationPath ?arg1 .
}
WHERE {
?a ?arg1 ?uniqueValue .
?b ?arg1 ?uniqueValue .
FILTER (?a != ?b)}
Global Ontology
RDB RDB Knowledge
Base
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 16
17. Contributions
• Domain-independent SPARQL query
templates for data quality problem identification
• Queries are highly reusable
• Architecture enables the use of Linked Data
• Methodology for data quality management of
Semantic Web data
• First approach on how to apply SPIN for DQM
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 17
18. Limitations & Open Issues
• Knowing the problem does not mean we can
solve it
• Homonym / Synonym handling
• Incomplete knowledge may cause constraint
violations of clean instances
• Current approach focuses on literal values
• Scalability on large data sets
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 18
19. Ongoing Extensions
• Extension to a broader set of data quality problems
• Enabling synonym handling and homonym tolerance
• Enhancement of peformance
• Calculation of information quality scores
• Integration of Linked Data as trusted reference for
data quality management
• Evaluate the quality of popular Semantic Web data sets
on instance level (e.g. Geonames & DBPedia)
• Extension for (semi-)automated data cleansing
C. Fürber, M. Hepp: Using SPARQL and SPIN for Data
Quality Management on the Semantic Web 19
20. Christian Fuerber
Researcher
E-Business & Web Science Research Group
Werner-Heisenberg-Weg 39
85577 Neubiberg
Germany
skype c.fuerber
email christian@fuerber.com
web http://www.unibw.de/ebusiness
homepage http://www.fuerber.com
Paper is available at http://bit.ly/bYes0V
20
21. References & Links
LOD-Cloud:
http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html
D2RQ:
http://www4.wiwiss.fu-berlin.de/bizer/d2rq/spec/
SPIN:
http://spinrdf.org/
TopBraid Composer Free Edition:
http://www.topquadrant.com/products/TB_Composer.html#free
C. Fürber, M. Hepp: 21
Using SPARQL and SPIN for Data Quality
Management on the Semantic Web