Presentation of the paper "LOP – Capturing and Linking Open Provenance on LOD Cycle" at 5th Internacional Workshop on Semantic Web Information Management (SWIM 2013). New York, USA – June 23, 2013
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
LOP – Capturing and Linking Open Provenance on LOD Cycle
1. > LOP – Capturing and Linking
Open Provenance on LOD Cycle
Rogers R. de Mendonça, Jonas F. S. M. De La Cerda, Kelli F. de Cordeiro
Sérgio M. S. da Cruz, Maria Cláudia Cavalcanti, Maria Luiza M. Campos
5th Internacional Workshop on
Semantic Web Information Management
SWIM 2013
New York, USA – June 23, 2013
2. >Outline
Introduction
– Provenance
– Linked Open Data Lifecycle
An Approach for Linked Open Provenance Capture
– Data Preparation and Transformation Process– Data Preparation and Transformation Process
– Data Interlinking Process
– Linked Open Provenance Architecture
– Usage Scenario
Conclusion
– Contributions
– Future Works
3. >Increase of the Web of Data
What about
data reliability and quality ?
4. >
Information about the history of the data:
– Where did the data come from?
– Who designed the publishing process?
– Who executed the publishing process?
– Which operations were applied to the data?
Provenance
Importance to the Web of Data:
– Support quality and reliability assessment of the
published data
6. >
Provenance data available according to LOD principles:
1. Use URIs as names for things
2. Use HTTP URIs, so that people can look up those
names
3. When someone looks up a URI, provide useful
information, using the standards (RDF, SPARQL)
Linked Open Provenance (LOP)
information, using the standards (RDF, SPARQL)
4. Include links to other URIs, so that they can discover
more things
8. >Related Works
Use of provenance to support quality and reliability
assessment of published data
– Provenance Information in the Web of Data (HARTIG,
2009)
– Managing the life-cycle of linked data with the LOD2
stack. (AUER et al, 2012)stack. (AUER et al, 2012)
– Linked Data Quality Assessment and Fusion
(MENDES et al, 2012)
Focus on metadata about the source and access of the
data
17. >Data Preparation and Transformation Process
Heterogeneous
Data Sources
Triplify
Extract
Clean
Conform
Pre-Integrate
Data Preparation and Transformation
Process
ETL (Extraction-Tranformation-Loading) approach:ETL (Extraction-Tranformation-Loading) approach:
– Foundation of DW systems
– Its techniques and tools have been developed and
refined over many years in challenging BI scenarios
– It is very advantageous to inherit the potential of
theses techniques and tools to publish LOD and LOP
18. >Data Preparation and Transformation Process
Heterogeneous
Data Sources
Triplify
Extract
Clean
Conform
Pre-Integrate
Data Preparation and Transformation
Process
Use of a workflow to have:Use of a workflow to have:
– Systematization of the publishing process
– Monitoring and management of the several tasks
– Facilities for reusing the process
Pentaho Data Integration (a.k.a. Kettle)
– Open source, large community of users, extensible
21. >Data Interlinking Process
Data Interlinking Process
Web Data
Access
Schema
Mappings
Identity
Resolution
Quality
Evaluator
Extracts data from its original sources
22. >Data Interlinking Process
Data Interlinking Process
Web Data
Access
Schema
Mappings
Identity
Resolution
Quality
Evaluator
Matches corresponding terms of
multiple vocabularies
23. >Data Interlinking Process
Data Interlinking Process
Web Data
Access
Schema
Mappings
Identity
Resolution
Quality
Evaluator
Finds and links similar resources on
different datasets
24. >Data Interlinking Process
Data Interlinking Process
Web Data
Access
Schema
Mappings
Identity
Resolution
Quality
Evaluator
Evaluates data quality based on a set
of rules
25. >Provenance Oportunity
Data Interlinking Process
Heterogeneous
Data Sources
Triplify
Extract
Clean
Conform
Pre-Integrate
Data Preparation and Transformation
Process
All steps need heavy parameterization and produce a
lot of results
– Employed parameter values and techniques as well
as results obtained are all provenance data
Web Data
Access
Schema
Mappings
Identity
Resolution
Quality
Evaluator
28. >Implementation of PGA
Provenance Gathering Agent
RDF Triple
Triple StoreTriple Store
Provenance
Data
Staging DatabaseStaging Database
29. >Implementation of PGA
The andThe PGA wraps the ETL process and
stores provenance in data staging
tables to be further extracted,
RDF Triple
Triple StoreTriple Store
Provenance
Data
Staging DatabaseStaging Database
tables to be further extracted,
triplified and loaded to the triple store
by other specific steps, developed
through Kettle API and Linked Open
Data frameworks
30. >Implementation of PGA
Web Data Access
Schema MappingsSchema Mappings
Identity Resolution
Provenance Gathering Agent was
implemented as a web service
written in Scala (www.scala-lang.org)
Provenance Gathering Agent was
implemented as a web service
written in Scala (www.scala-lang.org)
32. >Use Case Scenario
CNPq = Brazilian governmental organization
responsible for fostering scientific research
RNP = Brazilian governmental organization
that finances research projects
35. >
SELECT ?group_name ?project_name ?researcher_uri ?process_name
FROM NAMED <http://linkgraph.provenance.br>
FROM NAMED <http://datagraph.provenance.br>
FROM NAMED <http://www.cnpq.br>
FROM NAMED <http://lattes.cnpq.br>
WHERE
{
GRAPH <http://linkgraph.provenance.br> {
?row_uri provprop:cnpqResearchGroup ?group_uri .
?row_uri provprop:lattesProject ?project_uri .
?row_uri provprop:lattesResearcher ?researcher_uri . }
GRAPH <http://datagraph.provenance.br> {
Gets researcher’s groups,
projects and researchers
from data graphs of domain
dataset
Querying Linked Open Provenance
GRAPH <http://datagraph.provenance.br> {
?row_uri opmv:wasGeneratedBy ?process_uri .
?process_uri provprop:composition ?process_def_uri .
?process_def_uri dcterms:title ?process_name . }
GRAPH <http://www.cnpq.br> {
?group_uri cnpq:project ?project_uri .
?group_uri foaf:name ?group_name . }
GRAPH <http://lattes.cnpq.br> {
?project_uri foaf:name ?project_name .
?researcher_uri foaf:name ?researcher_name . }
}
Data, that were in differents datasources of the CNPq
organization, are now integrated in the Web of Data.
36. >Querying Linked Open Provenance
SELECT ?group_name ?project_name ?researcher_uri ?process_name
FROM NAMED <http://linkgraph.provenance.br>
FROM NAMED <http://datagraph.provenance.br>
FROM NAMED <http://www.cnpq.br>
FROM NAMED <http://lattes.cnpq.br>
WHERE
{
GRAPH <http://linkgraph.provenance.br> {
?row_uri provprop:cnpqResearchGroup ?group_uri .
?row_uri provprop:lattesProject ?project_uri .
?row_uri provprop:lattesResearcher ?researcher_uri . }
GRAPH <http://datagraph.provenance.br> {
Also gets the integration
process from provenance
graphs of Linked Open
Provenance dataset
GRAPH <http://datagraph.provenance.br> {
?row_uri opmv:wasGeneratedBy ?process_uri .
?process_uri provprop:composition ?process_def_uri .
?process_def_uri dcterms:title ?process_name . }
GRAPH <http://www.cnpq.br> {
?group_uri cnpq:project ?project_uri .
?group_uri foaf:name ?group_name . }
GRAPH <http://lattes.cnpq.br> {
?project_uri foaf:name ?project_name .
?researcher_uri foaf:name ?researcher_name . }
}
37. >
group_name project_name research_uri process_name
"GRECO - Grupo
Engenharia do
Conhecimento"@pt
"LinkedDataBR -
Exposição,
compartilhamento e
http://lattes.cn
pq.br/resourc
e/Researcher/
"Merge CNPq
Research Groups
x Lattes Projects"
Querying Linked Open Provenance
Conhecimento"@pt compartilhamento e
conexão de recursos de
dados abertos na Web
(Linked Open Data)"@pt
e/Researcher/
K4781460T3
x Lattes Projects"
"GRECO - Grupo
Engenharia do
Conhecimento"@pt
"Núcleo de Pesquisa de
Sistemas Computacionais
Complexos para a Gestão
de Emergências"@pt
http://lattes.cn
pq.br/resourc
e/Researcher/
K4717449A7
"Merge CNPq
Research Groups
x Lattes Projects"
"GRECO - Grupo
Engenharia do
Conhecimento"@pt
"Identificação e Análise de
Redes Sociais
Complexas"@pt
http://lattes.cn
pq.br/resourc
e/Researcher/
K4761314U5
"Merge CNPq
Research Groups
x Lattes Projects"
40. >Use Case Scenario – Provenance Evaluation
At the end of the execution of both processes, a
SPARQL query could be used to ask: “At which
projects does a researcher work?”
The result would include projects declared in the CNPq
dataset and in the RNP datasetdataset and in the RNP dataset
If the projects returned by CNPq diverges from RNP, it
is possible to investigate the cause by querying and
evaluating LOP data
41. >Conclusion - Contributions
New strategy to provide provenance for data and links
of Web of Data
LOD cycle is extended with a systematic data
preparation and transformation process, supported by
an ETL workflow frameworkan ETL workflow framework
Provenance data is available according to LOD
principles (Linked Open Provenance)
42. >Conclusion – Future works
Development of provenance query interface
– Take advantage of LOP and support its exploration
Development / evolution of a provenance ontology
– Today, we are using a combination of vocabularies
Investigation in the area of Big Data
– Fine-grained provenance generates large volumes of
data
43. >Thank You !
LOP – Capturing and Linking Open
Provenance on LOD Cycle
Rogers R. de Mendonça 1
rogers@ufrj.br
Jonas F. S. M. De La Cerda 2
jonas.ferreira@uniriotec.br
Kelli F. de Cordeiro 1
kelli@ufrj.br
Sérgio M. S. da Cruz 3
serra@ufrrj.br
Maria Cláudia Cavalcanti 2
yoko@ime.eb.br
Maria Luiza M. Campos 1
mluiza@ppgi.ufrj.br
1 Federal University of
Rio de Janeiro - UFRJ
2 Military Institute of
Engineering - IME
3 Federal Rural University
of Rio de Janeiro - UFRRJ