Provenance is broadly defined as the origin or source from which something comes and the history of subsequent owners. In the context of data, process and computation-intensive disciplines, provenance focuses on the description and understanding of where and how data is produced, the actors involved in its production, and the processes applied to it. Provenance has been a hot topic in the last years in scientific disciplines, with a strong emphasis in eScience, where technology and means for representing provenance have been proposed, ranging between different degrees of expressivity. Since the amount of data involved has increased in the different domains, provenance models have eventually evolved into semantic overlays, which describe provenance at different levels of granularity, facilitating user understanding. Nowadays, the need of provenance analysis has expanded beyond scientific domains into the Web of Data arena. The abundance of data is encouraging organizations and governments to publish and expose their data in a way that can be made available to the public and reused for a number of purposes through the Linked Data initiative. However, while an important number of large and interlinked data sets such as the UK government and the BBC web sites are starting to be now publicly available, important challenges still need to be addressed before this vision can be achieved. Amongst them, provenance is one of the most outstanding issues in order to guarantee data quality, trustworthiness and realiability in the Web of Data. In this talk, we will provide an insight on provenance, from eScience to the Web of Data, describing old problems and new challenges, which need to be addressed in the upcoming years.
3. Provenance is…
Records of
Origin or source from
which something comes
History of subsequent
owners (change of
custody)
Adapted from James Cheney’s Principles of Provenance
3
4. Provenance is…
Evidence of authenticity, integrity,
and quality
Certifies products of good process
Adapted from James Cheney’s Principles of Provenance
4
5. Provenance is…
Valuable
Hard to collect and verify
Necessary to assign credit
…and blame
i.e. establish
Trust
Adapted from James Cheney’s Principles of Provenance
5
6. Why provenance of electronic data is difficult
Paper data Electronic data
Creation process leaves Often, there is no bits
paper trail trail
Easier to detect Easy to forge,
modification, copy, plagiarize, and modify
forgery data
Usually, one can judge There is no cover to
a book by the cover judge by
Addressing this requires
explicitly representing the
provenance of data, store it, keep it
secure, and reason with it.
Adapted from James Cheney’s Principles of Provenance
6
7. Provenance in eScience
One of the most active fields in Provenance development
Curated scientific biologic databases
- Ensure database quality
- Need provenance for data quality control and accountability
- Currently done manually by curators
Scientific workflows – grid computing
- Abstract process execution complexity
- Need provenance for process reproducibility, efficiency
- Currently supported by ad-hoc systems
7
11. Semantic overlays for provenance analysis
Objective: To support domain experts in
Problem Solving Methods
understanding process executions (PSMs) (McDermott 1988)
How
• Provide reusable guidelines
to formulate process
knowledge
• Support reasoning
• Describe the main rationale
Semantic behind a process
What
Overlays
Whom
PROVENANCE SMEs
11
12. PSM perspectives
Task-method Interaction
decomposition
Black-box perspective
Knowledge transformation
within the PSM
Hierarchically defines how tasks
PSM establishes and controls the decompose into simpler
sequence of actions required to (sub)tasks
perform a task Describes tasks at several levels
Defines knowledge required at of detail
each task step
Provides alternative ways to
achieve a task
Knowledge flow
Task
Method
Role
12
13. Towards knowledge provenance
PSMs as semantic overlays on top
of existing process documentation
Task: What is going to be
achieved by executing a process
PSM: HOW
Provenance, from a knowledge perspective
- How recorded provenance relates to the execution of a
process
- Simpler process analysis proposing decompositions into
simpler subprocesses
- Visualize provenance at different levels of detail
Supporting domain experts in two main ways
- Validation of process executions
Source: myGrid - Identification of reasoning patterns in process executions
13
14. The twig join function
Based on XML pattern matching algorithms on Directed Acyclic
Graphs (Bruno et al., 2002)
twig_join detects the occurrence of a pattern in a XML DAG
Given
- P, a process
- T, a task potentially describing P
- M, a PSM providing a strategy on how to achieve T
- i(T), the set of input roles of T
- o(T), the set of output roles of T
- D, the DAG resulting from documenting the execution of P
twig_join(D,i(T),o(T)) checks whether a twig exists for M that
connects i(T) with o(T) in D
In this case, PSM M is the pattern to be identified in the process
documentation DAG D
14
15. A twig join example in provenance analysis
Domain Bridges PSM entities
entities (mapping)
twig join!
15
16. The matching algorithm
• twig_join recursively applied at
Task-method
decomposition
each decomposition level
• Each task decomposed by one
or several PSMs (task-method
twig_join(Ti, D) decomposition view)
• Knowledge flow defines the
sequence of evaluation
decompose(Ti)
twig_join(T11, D)
Knowledge flow
twig_join(T12, D)
twig_join(T13, D)
Backtracking
possible at PSM and
role levels
twig_join(T14, D)
Interaction
16
20. KOPE evaluation (II)
120%
Focus on precision and recall
100%
metrics
80%
60%
Precision Identified at three different
Recall
40% layered contexts
20% - Method
0%
Level1 Level2 Level3 Level4 - Task
Goal 1: identify the main
- Decomposition-level
rationale behind process
executions by detecting
occurrences of semantic
overlays in their logs
Goal 2: To exploit the
structure of semantic
overlays to describe
process executions at
different levels of detail
Perfect match
Partial match
No match
20
23. While the economy contracts, the digital universe expands…
Source: IDC
In 2006, the size of the digital universe
was estimated in 161 exabytes
3 million times, the information in all
books ever written
By 2010, expected to turn 988
exabytes
…and all this data is potentially
exposed online
23
25. The Linked Data paradigm
Tim Berners Lee, 2006 (Design Issues)
How can we
exploit all the
available data? 1. Use URIs to identify things
- Anything, not just documents
2. Use HTTP URIs for people to
Data reuse and remix lookup such names
Common flexible and usable APIs - Globally unique names
Standard vocabularies to - Distributed ownership
describe interlinked datasets 3. Provide useful information in RDF
Tools upon URI resolution
Realize the Semantic Web vision 4. Include RDF links to other URIs
- Enable discovery of related
information
25
31. The Web of Data
Apply the Linked Data principles to expose open datasets in
RDF
Define RDF links between data items for different datasets
Over 7.5 billion triples, 5 million links (as of November 2009)
31
34. A real-life example
Linking and exploiting distributed data sets without the
means that allow contrasting its provenance can be harmful,
Two fake web sites
especially in sensitive domains.
A fake Wikipedia entry
Fake California public safety phone
numbers
The hoax caused a 1000-word tome on
Frankfurter Allgemeine Zeitung… and
public apologies from DPA
Trust on Wikipedia misled DPA
In a provenance-aware world, DPA
would have had means based on data
provenance to automatically check that
- The town did not exist
- The Berlin Boys do not exist
- The reporting local TV station does not exist
34
35. The Linked Data flow
Linked Data applications
Data trustworthiness
Exploit Linked Data
SPARQL EPRs
Provenance
Provenance
Linked Data
Data quality
Publish Linked Data
(RDF, HTTP, URIs)
Web documents
Data lineage
Multimedia
Legacy resources e.g.
DBs, XML repositories
35
36. Provenance and Linked Data
Linked Data is largely about reusing. However, reusing data from 3rd
parties requires knowing its provenance!!! Is the data Is the quality
reliable? of the data
Provenance shall provide the ability to good?
- Trace the sources of data
- Enable the exploration of relationships between datasets, their authors and
affiliations
Provenance analysis shall provide an insight on how data is produced
and exploited
Provenance should create a notion of information quality
- is a certain dataset consistent and up to date?
- is the connection between two interlinked datasets meaningful?
- is a given dataset relevant for a particular domain?
Provenance to establish information trustworthiness
Provenance to provide data views following some criteria
36
37. Provenance challenges in the Web of Data
Provenance information needs to be
Represented
Captured and recorded
Stored and secured, queried, and reasoned about
Visualized and browsed
37
38. A Provenance architecture for the Web of Data
Authoritative
agencies required
to certify and keep
data provenance
secure!!!
38
39. Semantics in support of provenance in the Web of Data
Semantic Web Provenance
stack stack
This, we still
need to define!
Information quality
inference
Trust inference
Reasoning with provenance
Provenance querying
Provenance capture
Provenance access policy definition
Provenance encryption
39
40. Towards a model of Web Data provenance
Adapted from Olaf Hartig’s Provenance
Information in the Web of data
Provenance represented as a graph
- Nodes: provenance elements (pieces of provenance information)
- Edges: relate provenance elements to each other
- Subgraphs for related data items possible
Provenance models define
- Types of provenance elements (roles)
- Relationships between them
Actor
Execution
Artifact
40
41. Provenance-related vocabularies
DC – Dublin Core Metadata Terms
FOAF – Friend of a Friend
SIOC – Semantically-Interlinked Online
Communities
SWP – Semantic Web Publishing vocabulary
WOT – Web Of Trust schema
VOiD – VOcabulary of Interlinked Datasets
However, general lack of
provenance-related
metadata on the Web of
Data!
41
42. Action points
Provenance Awareness of Tools for data
vocabularies data providers providers
Represent and reason
Generation of
with trust and
W3C Provenance IG provenance metadata
information quality
Extend emerging
Provenance
Linked Data
authoritative agencies
vocabularies
Linked Data
standards (VOiD Provenance
VOiD again) visualization
42
45. José Manuel Gómez-Pérez
Thanks for R&D Director
your T +34913349778
attention! M +34609077103
jmgomez@isoco.com
iSOCO
Para obtener más información sobre como puede
ayudar a su empresa a optimizar sus negocios digitales y aportar
una solución innovadora, contáctenos en
www. .com
Barcelona Madrid Valencia
Tel +34 93 5677200 +34 91 3349797 +34 96 3467143
Edificio Testa A C/Pedro de Valdivia, 10 Oficina 107
C/ Alcalde Barnils 64-68 28006 Madrid C/ Prof. Beltrán Báguena 4,
St. Cugat del Vallès 46009 Valencia
08174 Barcelona
45