SlideShare uma empresa Scribd logo
1 de 55
NII,Tokyo,July2014–PaoloMissier
The W3C PROV standard:
data model for the provenance of information,
and enabler for trustworthy publication
and exchange of open data
Paolo Missier, PhD
School of Computing Science
Newcastle University
Newcastle upon Tyne, UK
NII, Tokyo, July, 2014
NII,Tokyo,July2014–PaoloMissier
Motivation: generating and publishing genomics data
• Next Generation Sequencing at the forefront of genomics
• the number of DNA base pairs that can be sequenced per $ doubles every five
months (2010)
• In the UK, the cost of sequencing a single patient sample is currently just under
$1.5K and decreasing
• Genetic testing: from research method to clinical diagnostic tool
• Key technology: Whole-exome / Whole-genome processing
pipelines (WEP/WGP)
• Key problem: assessing the reliability of the results
Goal of data processing and interpretation:
to rapidly identify genetic mutations across the entire genome, which:
• Have known associations to genetic diseases
• Are unknown but potentially deleterious
Specifically important in the study of rare diseases
NII,Tokyo,July2014–PaoloMissier
Data publication and reuse in science/biology/genomics
Public, genome-wide gene expression data is potentially highly
reusable
Rung, Johan, and Alvis Brazma. “Reuse of Public Genome-Wide Gene Expression Data.”
Nature Reviews. Genetics 14, no. 2 (March 2013): 89–99. doi:10.1038/nrg3394.
But:
• Published data must be provably correct, trustworthy
Approximately half of the studies that use public gene expression data rely
solely on existing data without adding newly generated data, and half of
them use the public data in combination with new data.
Problem:
• A large WEP/ WES space, many experimental
configurations, many possible results
NII,Tokyo,July2014–PaoloMissier
Workflow for programming pipelines
NII,Tokyo,July2014–PaoloMissier
Multiple Workflow systems for implementing pipelines…
[1] Torri, Federica, Ivo D Dinov, Alen Zamanyan, Sam Hobel, Alex Genco, Petros Petrosyan, Andrew
P Clark, et al. “Next Generation Sequence Analysis and Computational Genomics Using
Graphical Pipeline Workflows.” Genes 3, no. 3 (August 30, 2012): 545–575.
doi:10.3390/genes3030545.
[2] Goecks, Jeremy, Anton Nekrutenko, and James Taylor. “Galaxy: A Comprehensive Approach for
Supporting Accessible, Reproducible, and Transparent Computational Research in the Life
Sciences.” Genome Biology 11, no. 8 (January 2010): R86. doi:10.1186/gb-2010-11-8-r86.
[3] Reid, Jeffrey, Andreq Carroll, Narayanan Veeraraghavan, and Mahmoud Dahdouli. “Launching Genomics
into the Cloud: Deployment of Mercury, a next Generation Sequence Analysis Pipeline.” BMC
Bioinformatics (2014).
Loni pipeline (UCLA, USA) [1]
Newcastle, UK [4]
Mercury [3]
Baylor College of Medicine,
Houston. Tx., USA)
[2]
[4] Watson, Paul, Hugo Hiden, and Simon Woodman. “E-Science Central for CARMEN:
Science as a Service.” Concurrency and Computation: Practice and Experience 22, no. 17
(2010): 2369–2380. doi:10.1002/cpe.1611.
NII,Tokyo,July2014–PaoloMissier
Multiple Pipeline configurations
Many tools to choose from, multiple ways to configure each tool
From: Pabinger, Stephan, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana
Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, and Zlatko Trajanoski. “A
Survey of Tools for Variant Analysis of next-Generation Genome Sequencing Data.” Briefings in
Bioinformatics (January 21, 2013): bbs086–. doi:10.1093/bib/bbs086.
NII,Tokyo,July2014–PaoloMissier
… and different configurations yield very different results
Outcomes are very sensitive to pipeline configuration
False positives, false negatives
The set of genetic mutations identified in one individual may vary
greatly depending on the tools used
Also: tools evolve over time  longitudinal variations over results
NII,Tokyo,July2014–PaoloMissier
The Cloud-e-Genome project
Goal 1:
provide mechanisms to rapidly and flexibly
create new WEP pipelines, and to deploy them
in a scalable way;
Goal 2:
provide clinicians with a tool for analysis
and interpretation of human variants
• 2 year pilot project
• Funded by UK’s National Institute for Health Research (NIHR)
through the Biomedical Research Council (BRC)
Challenge:
to deliver the benefits of WES/WGS technology to clinical practice
NGS data processing
Human variant
interpretation for
clinical diagnosis
NII,Tokyo,July2014–PaoloMissier
Implementing the pipeline using workflow technology
NII,Tokyo,July2014–PaoloMissier
Pipeline evolution
Pipeline:
set C = { c1 … cn } of components -- tool wrappers
Each ci has a configuration conf(ci) and a version v(ci)
…and why
• Technology / algorithm evolution
• Traditional GATK variant caller 
GATK haplotype caller
• Does the interface change?
• Do the operational assumptions
change?
Eg. GATK Variant Recalibrator
requires large input data. Not suitable for
targeted sequencing
What can change
1 – Tool version:
v(ci)  v’(ci)
2 - Tool replacement / add / remove:
ci  c’I
3 – Configuration parameters
conf(ci)  conf’(ci)
(*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J.
Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.”
Briefings in bioinformatics, pp. bbs086–, Jan. 2013
Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while
for variant annotation they refer to over 70 tools
NII,Tokyo,July2014–PaoloMissier
How do you know published results are sound?
Mechanisms for data dissemination exist
Data journals
Data repositories
Data structures: Research Objects
(from ResearchObject.org)
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P.
Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer
Systems (2011). doi:doi:10.1016/j.future.2011.08.004.
… but they are not enough to meet two key requirements:
• Attribution of published data to its producers
• Verifiability and reproducibility of scientific results
NII,Tokyo,July2014–PaoloMissier
Role of provenance
Provenance refers to the sources of information, including entities
and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
• Provenance is evidence in support of clinical diagnosis
1. Why do these variants appear in the output list?
2. Why have you concluded they are disease-causing?
• Requires ability to trace variants through workflow execution
• Workflow managers provide this
“Why are these variants included in the results?”
“Why do these two results differ?”
NII,Tokyo,July2014–PaoloMissier
Why does provenance matter?
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable process analysis for debugging, improvement,
evolution
NII,Tokyo,July2014–PaoloMissier The W3C Working Group on Provenance:
timeline
1
4
W3C
Incubator group
on provenance
Chair: Yolanda Gil,
ISI, USC
W3C
working group
approved
Chairs:
Luc Moreau,
Paul Groth
2009-2010
Main output:
“Provenance XG Final Report”
http://www.w3.org/2005/Incubator/prov/XGR-prov/
- provides an overview of the various existing
approaches, vocabularies
- proposes the creation of a dedicated W3C Working
Group
April, 2011 April, 2013
Proposed
Recommendations
finalised
prov-dm: Data Model
prov-o: OWL ontology, RDF encoding
prov-n: prov notation
prov-constraints
...plus a number of non-prescriptive
Notes
http://www.w3.org/2011/prov/wiki/
NII,Tokyo,July2014–PaoloMissier PROV: scope and structure
1
5
source: http://www.w3.org/TR/prov-overview/
Recommendation
track
NII,Tokyo,July2014–PaoloMissier PROV Core Elements (graph depiction)
1
6
An entity is a physical, digital, conceptual, or other kind of thing with some fixed
aspects; entities may be real or imaginary.
An activity is something that occurs over a period of time and acts upon or with entities; it
may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place,
for the existence of an entity, or for another agent's activity.
Jump to alternate
NII,Tokyo,July2014–PaoloMissier Generation, Usage
1
7
Generation is the completion of production of a new entity by an activity. This entity did not
exist before generation and becomes available for usage after this generation.
Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had
not begun to utilize this entity
PROV is based on a notion of instantaneous events, that mark transitions in the world
- generation, usage (and others)
Ordering constraints amongst events:
“generation of e must precede each of usages”
“a can only use / generate e after it has started and before it has ended”
NII,Tokyo,July2014–PaoloMissier Concepts and relations
1
8
Generation of “draft v1” expressed as relation:
wasGeneratedBy(“draft v1”, ...)
Usage of “draft v1” by “commenting” expressed as relation:
used(“commenting, “draft v1”,...)
NII,Tokyo,July2014–PaoloMissier PROV notation
1
9
document
prefix prov <http://www.w3.org/ns/prov#>
prefix ex <http://www.example.com/>
entity(ex:draftComments)
entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"])
entity(ex:paper1)
entity(ex:paper2)
activity(ex:commenting)
activity(ex:drafting)
wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00)
used(ex:commenting, ex:draftV1, -)
wasGeneratedBy(ex:draftV1, ex:drafting, -)
used(ex:drafting, ex:paper1, -)
used(ex:drafting, ex:paper2, -)
endDocument
NII,Tokyo,July2014–PaoloMissier Same example — PROV-O notation (RDF/N3)
2
0
:draftComments a prov:Entity ;
:distr "internal"^^xsd:string ;
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity ;
:distr "internal"^^xsd:string ;
:status "draft"^^xsd:string ;
:version "0.1"^^xsd:string ;
prov:wasGeneratedBy :drafting .
:drafting a prov:Activity ;
prov:used :paper1,
:paper2 .
:paper1 a prov:Entity,
"reference"^^xsd:string .
:paper2 a prov:Entity,
"reference"^^xsd:string .
NII,Tokyo,July2014–PaoloMissier Association, Attribution, Delegation: who did what?
2
1
An activity association is an assignment of responsibility to an agent for an activity,
indicating that the agent had a role in the activity.
Attribution is the ascribing of an entity to an agent.
entity(ex:draftComments, [ ex:distr='internal' ])
activity(ex:commenting)
agent(ex:Bob, [prov:type = "mainEditor"] )
agent(ex:Alice, [prov:type = "srEditor"])
wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])
actedOnBehalfOf(Bob, Alice)
wasAttributedTo(ex:draftComments, ex:Bob)
NII,Tokyo,July2014–PaoloMissier Same example — PROV-O notation (RDF/N3)
2
2
:Alice a prov:Agent,
"ex:chiefEditor";
:firstName "Alice";
:lastName "Cooper".
:Bob a prov:Agent,
"ex:seniorEditor";
:firstName "Robert";
:lastName "Thompson"^;
prov:actedOnBehalfOf :Alice .
:draftComments prov:wasAttributedTo :Bob .
:drafting a prov:Activity ;
prov:wasAssociatedWith :Bob .
NII,Tokyo,July2014–PaoloMissier Association and Attribution
2
3
Q.: what is the relationship between attribution and association?
This is defined as an inference rule in the PROV-CONSTR document
entity(e)
agent(Ag)
activity(a)
wasAttributedTo(e, Ag)
wasGeneratedBy(e, a)
wasAssociatedWith(a, Ag)
NII,Tokyo,July2014–PaoloMissier Communication amongst activities
2
4
Communication is the exchange of some unspecified entity by two
activities, one activity using some entity generated by the other.
activity(ex:commenting)
activity(ex:drafting)
wasInformedBy(ex:commenting, ex:drafting)
:drafting a prov:Activity .
:commenting a prov:Activity ;
prov:wasInformedBy :drafting .
NII,Tokyo,July2014–PaoloMissier Communication, generation, usage
2
5
activity(ex:commenting)
activity(ex:drafting)
entity(e)
wasInformedBy(ex:commenting, ex:drafting)
wasGeneratedBy(e,ex:drafting)
used(ex:commenting, e)
Q.: what is the relationship between communication, generation, and usage?
This are inference rules 5 and 6 in the PROV-CONSTR document
NII,Tokyo,July2014–PaoloMissier Summary of the PROV Core model
2
6
NII,Tokyo,July2014–PaoloMissier Derivation amongst entities
2
7
A derivation is a transformation of an entity into another, an update of an entity
resulting in a new one, or the construction of a new entity based on a pre-existing
entity.
entity(ex:draftV1)
entity(ex:draftComments)
wasDerivedFrom(ex:draftComments, ex:draftV1)
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ;
prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .
NII,Tokyo,July2014–PaoloMissier Relations may be given identifiers
2
8
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -)
used(use1; ex:commenting, ex:draftV1, -)
gen1 denotes a generation event
use1 denotes a usage event
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
General derivation relation:
Relation IDs make it possible to refer to relations in other relations
NII,Tokyo,July2014–PaoloMissier Rendering N-ary relations in PROV-O
2
9
RDF is for binary relations —- N-ary relations require reification
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(gen1; ex:draftComments,
ex:commenting,
2013-03-18T10:00:01)
used(use1; ex:commenting, ex:draftV1, -)
:draftComments a prov:Entity ;
prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ;
prov:activity :commenting;
prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ;
prov:qualifiedUsage :use1 .
:use1 a prov:Usage ;
:note "found comments useful";
prov:atTime "2013-03-21T10:00:01+09:00";
prov:entity :draftV1.
NII,Tokyo,July2014–PaoloMissier “Qualified relation” RDF pattern
3
0
:draftComments a prov:Entity ;
prov:qualifiedGeneration :gen1 .
:gen1 a prov:Generation ;
prov:activity :commenting;
prov:atTime “2013-03-18T10:00:01+09:00".
:commenting a prov:Activity ;
prov:qualifiedUsage :use1 .
:use1 a prov:Usage ;
:note "found comments useful";
prov:atTime "2013-03-21T10:00:01+09:00";
prov:entity :draftV1.
NII,Tokyo,July2014–PaoloMissier Plans — why was something done?
3
1
Most relation types have two arguments which are { Entity, Activity, Agent}
Derivation is one exception:
wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)
Two other notable exceptions:
- Associations with a plan
- Delegation with an activity scope
wasAssociatedWith(id; a, ag, pl, attrs)
A plan is an entity that represents a set of actions or steps
intended by one or more agents to achieve some goal
NII,Tokyo,July2014–PaoloMissier Association with a plan
3
2
A plan plays a role in an association
NII,Tokyo,July2014–PaoloMissier Plans are typed entities
3
3
activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])
agent(ex:_aJVM, [prov:type = “JVM-6.0”])
entity(ex:myCleverProgram,
[prov:type='prov:Plan', ex:label="Program 1"])
wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM,
ex:myCleverProgram,
[prov:role="defaultRuntime",
ex:accessPath="webapp" ])
A plan is an entity having prov:type = “prov:plan”
NII,Tokyo,July2014–PaoloMissier Plan pattern as PROV-O
3
4
:_aProgramExecution a prov:Activity ;
:execTime "22.5sec;
prov:qualifiedAssociation [ a prov:Association ;
:accessPath "webapp";
prov:agent :_aJVM ;
prov:hadPlan :myCleverProgram ;
prov:hadRole "defaultRuntime"] .
:_aJVM a prov:Agent, “Java-6.0".
:myCleverProgram a prov:Entity, prov:Plan.
activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])
agent(ex:_aJVM, [prov:type = “JVM-6.0”])
entity(ex:myCleverProgram,
[prov:type='prov:Plan', ex:label="Program 1"])
wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM,
ex:myCleverProgram,
[prov:role="defaultRuntime",
ex:accessPath="webapp" ])
NII,Tokyo,July2014–PaoloMissier Plan pattern as PROV-O
3
5
:_aProgramExecution a prov:Activity ;
:execTime "22.5sec;
prov:qualifiedAssociation [ a prov:Association ;
:accessPath "webapp";
prov:agent :_aJVM ;
prov:hadPlan :myCleverProgram ;
prov:hadRole "defaultRuntime"] .
:_aJVM a prov:Agent, “Java-6.0".
:myCleverProgram a prov:Entity, prov:Plan.
NII,Tokyo,July2014–PaoloMissier Delegation within an activity scope
3
6
NII,Tokyo,July2014–PaoloMissier
Real-world artifacts vs provenance entities
3
7
ref: http://www.w3.org/2001/sw/wiki/PROV-FAQ#Examples_of_Provenance
“What do I know about the car I see in this Cambridge street today?”
•It was bought by Joe in 2011
•Joe drove it to Boston on March 16th,
2013. The car has now got 10,000 miles
on it
•Joe drove it to Cambridge on March
18th, 2013.
“Same” car, but different provenance at
each stage of its evolution
To Core
Elements
NII,Tokyo,July2014–PaoloMissier Alternate-specialization pattern
3
8
Two alternate entities present aspects of the same thing. These aspects may be the same or
different, and the alternate entities may or may not overlap in time.
An entity that is a specialization of another shares all aspects of the latter, and additionally
presents more specific aspects of the same thing as the latter.
...But, this is still that car!
Semantic notes:
1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2).
2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1)
3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).
To Core
Elements
differing in their
location
same owner,
added location
NII,Tokyo,July2014–PaoloMissier Reserved attributes and types
3
9
A small set of reserved attributes, with some usage restrictions
NII,Tokyo,July2014–PaoloMissier Bundles, provenance of provenance
4
0
A bundle is a named set of provenance descriptions, and is itself an entity,
so allowing provenance of provenance to be expressed.
bundle pm:bundle1
entity(ex:draftComments)
entity(ex:draftV1)
activity(ex:commenting)
wasGeneratedBy(ex:draftComments, ex:commenting,-)
used(ex:commenting, ex:draftV1, -)
endBundle
...
entity(pm:bundle1, [ prov:type='prov:Bundle' ])
wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00)
wasAttributedTo(pm:bundle1, ex:Bob)
NII,Tokyo,July2014–PaoloMissier Bundles in PROV-O
4
1
Bundle definition (an RDF named graph):
ex:bundle1 {
:draftComments a prov:Entity ;
:status “blah";
prov:wasGeneratedBy :commenting .
:commenting a prov:Activity ;
prov:used :draftV1 .
:draftV1 a prov:Entity .
}
Bundle usage:
ex:bundle1 a prov:Entity, "prov:Bundle";
prov:qualifiedGeneration [ a prov:Generation ;
prov:atTime “2013-03-20T10:30:00+09:00" ];
prov:wasAttributedTo :Bob .
NII,Tokyo,July2014–PaoloMissier Time, Events
4
2
wasStartedBy(id; a2, e, a1, t, attrs)
wasEndedBy(id; a2, e, a1, t, attrs)
Instead, the PROV data model is implicitly based on a notion of
instantaneous events, that mark transitions in the world (*)
(*) PROV-CONSTR http://www.w3.org/TR/prov-constraints/#events (non-normative)
Events:
- activity start, activity end,
- entity generation , entity usage, entity invalidation
- Provenance statements are combined by different systems
- An application may not be able to align the times involved to a single
global timeline
Therefore, PROV minimizes assumptions about time
NII,Tokyo,July2014–PaoloMissier From “scruffy” provenance to “valid” provenance
4
3
- Are all possible temporal partial ordering of events equally acceptable?
- How can we specify the set of all valid orderings?
More generally, how do we formally define what it means for a set of
provenance statements to be valid?
PROV defines a set of temporal constraints that ensure consistency
of a provenance graph
NII,Tokyo,July2014–PaoloMissier
Exploiting provenance: why do my results differ from yours?
Run pipeline version V1
V1  V2:
Replace BWA version
Modify Annovar configuration parameters
Variant list
VL1
Variant list
VL2Run pipeline version V2
??
Variant list
VL1
Variant list
VL2
DDIFF
(data differencing)
PDIFF
(provenance differencing)
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience
(2013): doi:10.1002/cpe.3035.
NII,Tokyo,July2014–PaoloMissier
PDIFF - overview
WA
WB
NII,Tokyo,July2014–PaoloMissier
The corresponding provenance traces
d1
S0
S0'
w h
S3 S2
y z
S4
x
k
S1
d2
d1'
S0
k'h'
S3'
S2v2
w'
S3
S4
y' z'
x'
S5
d2
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P0 P1P1
S Sv2
d0 d0
NII,Tokyo,July2014–PaoloMissier
Delta graph computed by PDIFF
x, x
y, y z, z
w, w
k, k
S0 , S3
S0'
S3'
S1, S5
(service repl.)
S2, S2v2
(version change)
h, h
S0'
P0 branch of S4 P1 branch of S4
P0 branch of S2 P1 branch of S2
S,Sv2
(version change)
S0, S0
d1, d1
PDIFF helps determine the impact of
variations in the pipeline
NII,Tokyo,July2014–PaoloMissier
Provenance of Linked Open Data resources
Goal: to establish a LD-compliant association between an LD
resource and a description of its provenance
• Where does the provenance of a LD resource live?
• How can it be accessed?
Why?
1. to enable LD search and discovery
• By indexing data by its provenance
• Ex. “Find all resources for which Alice is an author which contain data derived
from dataset D”
1. To enable reasoning about quality/reliability of the LD resource
• Predicates and rules over provenance
• Ex. “if D has been derived from either {A,B,C} and Alice is one of the authors,
then score  X”
NII,Tokyo,July2014–PaoloMissier
Provenance of Linked Open Data resources: how
How: Three mechanisms:
1. Provenance Access and Query (PROV-AQ) – part of the W3C
PROV recommendation suite
1. Embedding provenance statements within the resource itself
• Eg the “Nanopublication” model
2. Packaging data + provenance as a Research Object
NII,Tokyo,July2014–PaoloMissier
1. Provenance pingback and query service
Image reproduced from:
De Nies, Tom, Robert Meusel, Kai Eckert, Dominique Ritze, and Anastasia Dimou. “A
Lightweight Provenance Pingback and Query Service for Web Publications.” In
Procs. IPAW 2014. Cologne, Germany: Springer, 2014.
Objective: to decouple publishing of content and of its provenance (as LOD)
Scenario:
• Publishers publish content resources, are not responsible for provenance
• Eg. Mendeley, ResearchGate, etc.
• Authors publish provenance, are not responsible for publishing content
NII,Tokyo,July2014–PaoloMissier
2. Provenance Embedding
The nanopublication model is an example of provenance embedding
within a published RDF document
From nanopub.org:
A nanopublication is the smallest unit of publishable information:
an assertion about anything that can be uniquely identified and
attributed to its author.
Individual nanopublications can be cited by others and tracked for
their impact on the community.
NII,Tokyo,July2014–PaoloMissier
Nanopublication: example
Assertion: an “association” between a gene and a genetic disorder.
The strength of this association is given by a statistical p-value.
See nanopub.org for details
{ : a nanopub:Nanopublication ;
nanopub:hasAssertion :NanoPub_1_Assertion ;
nanopub:hasProvenance :NanoPub_1_Provenance .
:NanoPub_1_Provenance nanopub:hasAttribution :NanoPub_1_Attribution ;
nanopub:hasSupporting :NanoPub_1_Supporting .
:NanoPub_1_Assertion a nanopub:Assertion .
:NanoPub_1_Provenance a nanopub:Provenance .
:NanoPub_1_Attribution a nanopub:Attribution .
:NanoPub_1_Supporting a nanopub:Supporting .
}
:NanoPub_1_Assertion {
:Association_1 a sio:statistical-association ;
sio:has-measurement-value :Association_1_p_value ; sio:refers-to ...
}
:NanoPub_1_Attribution {
:pav:authoredBy res_a, reS_b.
:NanoPub_1_Assertion pav:createdBy ...;
}
:NanoPub_1_Supporting { :Association_1
opm:wasDerivedFrom gene_disease_concept_profiles_1980_2010...;
opm:wasGeneratedBy gene_disease_concept_profiles_matching_1980_2010; .
}
NII,Tokyo,July2014–PaoloMissier
3. Research Objects for data and provenance packaging
Research Objects (ROs) are semantically rich aggregations of resources that
bring together data, methods and people in scientific investigations.
A Research Object is a combination of:
• Aggregation (reusing Object Reuse and Exchange [ORE])
• Annotation (reusing the Annotation Ontology [AO])
• RO ontologies
From the Wf4Ever EU project
See also:
Belhajjame K, Corcho O, Garijo D, Zhao J, Missier P, Newman DR, Palma R, Bechhofer S et al.:
Workflow-Centric Research Objects: A First Class Citizen in the Scholarly Discourse. In
proceedings of the ESWC2012 Workshop on the Future of Scholarly Communication in the
Semantic Web (SePublica2012), Heraklion, Greece, May 2012
NII,Tokyo,July2014–PaoloMissier
Links to resources cited in the talk
• The PROV Data Model (PROV-DM): www.w3.org/TR/prov-dm/
• A primer on PROV with a simple running example:
http://www.w3.org/TR/prov-primer/
• LD and PROV:
• Nanopublications: nanopub.org
• Research Objects: researchobject.org
• The Wf4Ever project: www.wf4ever-project.org
• PROV Access and Query conventions (PROV-AQ):
http://www.w3.org/TR/prov-aq/
• Visualising provenance using PROV-O-Viz: http://provoviz.org/
• PROV-O-Viz video:
• PROV-O-Viz IPAW’14 paper preprint:
http://dare.ubvu.vu.nl/handle/1871/51388
• Reference:
Hoekstra, Rinke, and Paul Groth. “PROV-O-Viz - Understanding the
Role of Activities in Provenance.” In Procs. IPAW 2014. Springer, 2014.
NII,Tokyo,July2014–PaoloMissier

Mais conteúdo relacionado

Mais procurados

Flutter + tensor flow lite = awesome sauce
Flutter + tensor flow lite = awesome sauceFlutter + tensor flow lite = awesome sauce
Flutter + tensor flow lite = awesome sauceAmit Sharma
 
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxDataInfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxDataInfluxData
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
 
High Performance WordPress
High Performance WordPressHigh Performance WordPress
High Performance WordPressvnsavage
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersDaniel Zivkovic
 
KeyRock and Wilma - Openstack-based Identity Management in FIWARE
KeyRock and Wilma - Openstack-based Identity Management in FIWAREKeyRock and Wilma - Openstack-based Identity Management in FIWARE
KeyRock and Wilma - Openstack-based Identity Management in FIWAREÁlvaro Alonso González
 
PowerShell and Azure DevOps
PowerShell and Azure DevOpsPowerShell and Azure DevOps
PowerShell and Azure DevOpsMatteo Emili
 
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Michael Allen
 
An introduction to Microsoft Graph for developers
An introduction to Microsoft Graph for developersAn introduction to Microsoft Graph for developers
An introduction to Microsoft Graph for developersMicrosoft 365 Developer
 
How Fast is AI in Pharo? Benchmarking Linear Regression
How Fast is AI in Pharo? Benchmarking Linear RegressionHow Fast is AI in Pharo? Benchmarking Linear Regression
How Fast is AI in Pharo? Benchmarking Linear RegressionESUG
 
Functional to Visual: AI-powered UI Testing from Testim and Applitools
Functional to Visual: AI-powered UI Testing from Testim and ApplitoolsFunctional to Visual: AI-powered UI Testing from Testim and Applitools
Functional to Visual: AI-powered UI Testing from Testim and ApplitoolsApplitools
 
FIDO2 Specifications Overview
FIDO2 Specifications OverviewFIDO2 Specifications Overview
FIDO2 Specifications OverviewFIDO Alliance
 
Open APIs - State of the Market 2011
Open APIs - State of the Market 2011Open APIs - State of the Market 2011
Open APIs - State of the Market 2011John Musser
 
Bridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD PipelineBridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD PipelineDevOps.com
 
0 to hero with Azure DevOps
0 to hero with Azure DevOps0 to hero with Azure DevOps
0 to hero with Azure DevOpsChristos Matskas
 
[Giovanni Galloro] How to use machine learning on Google Cloud Platform
[Giovanni Galloro] How to use machine learning on Google Cloud Platform[Giovanni Galloro] How to use machine learning on Google Cloud Platform
[Giovanni Galloro] How to use machine learning on Google Cloud PlatformMeetupDataScienceRoma
 
TDM: Masking, Subsetting and generating Synthetic Data
TDM: Masking, Subsetting and generating Synthetic Data TDM: Masking, Subsetting and generating Synthetic Data
TDM: Masking, Subsetting and generating Synthetic Data CA Technologies
 
apidays Australia 2023 - API Strategy In The Era Of Generative AI,Shreshta Sh...
apidays Australia 2023 - API Strategy In The Era Of Generative AI,Shreshta Sh...apidays Australia 2023 - API Strategy In The Era Of Generative AI,Shreshta Sh...
apidays Australia 2023 - API Strategy In The Era Of Generative AI,Shreshta Sh...apidays
 

Mais procurados (20)

Flutter + tensor flow lite = awesome sauce
Flutter + tensor flow lite = awesome sauceFlutter + tensor flow lite = awesome sauce
Flutter + tensor flow lite = awesome sauce
 
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxDataInfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 
High Performance WordPress
High Performance WordPressHigh Performance WordPress
High Performance WordPress
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
KeyRock and Wilma - Openstack-based Identity Management in FIWARE
KeyRock and Wilma - Openstack-based Identity Management in FIWAREKeyRock and Wilma - Openstack-based Identity Management in FIWARE
KeyRock and Wilma - Openstack-based Identity Management in FIWARE
 
PowerShell and Azure DevOps
PowerShell and Azure DevOpsPowerShell and Azure DevOps
PowerShell and Azure DevOps
 
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
 
Mastering System Resiliency with AIOps
Mastering System Resiliency with AIOpsMastering System Resiliency with AIOps
Mastering System Resiliency with AIOps
 
An introduction to Microsoft Graph for developers
An introduction to Microsoft Graph for developersAn introduction to Microsoft Graph for developers
An introduction to Microsoft Graph for developers
 
How Fast is AI in Pharo? Benchmarking Linear Regression
How Fast is AI in Pharo? Benchmarking Linear RegressionHow Fast is AI in Pharo? Benchmarking Linear Regression
How Fast is AI in Pharo? Benchmarking Linear Regression
 
Functional to Visual: AI-powered UI Testing from Testim and Applitools
Functional to Visual: AI-powered UI Testing from Testim and ApplitoolsFunctional to Visual: AI-powered UI Testing from Testim and Applitools
Functional to Visual: AI-powered UI Testing from Testim and Applitools
 
FIDO2 Specifications Overview
FIDO2 Specifications OverviewFIDO2 Specifications Overview
FIDO2 Specifications Overview
 
Open APIs - State of the Market 2011
Open APIs - State of the Market 2011Open APIs - State of the Market 2011
Open APIs - State of the Market 2011
 
Bridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD PipelineBridging the Security Testing Gap in Your CI/CD Pipeline
Bridging the Security Testing Gap in Your CI/CD Pipeline
 
0 to hero with Azure DevOps
0 to hero with Azure DevOps0 to hero with Azure DevOps
0 to hero with Azure DevOps
 
DevSecOps
DevSecOpsDevSecOps
DevSecOps
 
[Giovanni Galloro] How to use machine learning on Google Cloud Platform
[Giovanni Galloro] How to use machine learning on Google Cloud Platform[Giovanni Galloro] How to use machine learning on Google Cloud Platform
[Giovanni Galloro] How to use machine learning on Google Cloud Platform
 
TDM: Masking, Subsetting and generating Synthetic Data
TDM: Masking, Subsetting and generating Synthetic Data TDM: Masking, Subsetting and generating Synthetic Data
TDM: Masking, Subsetting and generating Synthetic Data
 
apidays Australia 2023 - API Strategy In The Era Of Generative AI,Shreshta Sh...
apidays Australia 2023 - API Strategy In The Era Of Generative AI,Shreshta Sh...apidays Australia 2023 - API Strategy In The Era Of Generative AI,Shreshta Sh...
apidays Australia 2023 - API Strategy In The Era Of Generative AI,Shreshta Sh...
 

Semelhante a The W3C PROV standard: data model for the provenance of information, and enabler for trustworthy publication and exchange of open data

Grand round whsiao_may2015
Grand round whsiao_may2015Grand round whsiao_may2015
Grand round whsiao_may2015IRIDA_community
 
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality?  - William HsiaoHow Can We Make Genomic Epidemiology a Widespread Reality?  - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality? - William HsiaoWilliam Hsiao
 
Caulder - DIVOS BioITWorld 2015
Caulder - DIVOS BioITWorld 2015Caulder - DIVOS BioITWorld 2015
Caulder - DIVOS BioITWorld 2015Dana Caulder
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astrowebuploader
 
MNTL000_2016 Review 8_RVSD
MNTL000_2016 Review 8_RVSDMNTL000_2016 Review 8_RVSD
MNTL000_2016 Review 8_RVSDJonathan Lin
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsPhilip Bourne
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Big data, big knowledge big data for personalized healthcare
Big data, big knowledge big data for personalized healthcareBig data, big knowledge big data for personalized healthcare
Big data, big knowledge big data for personalized healthcareredpel dot com
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen ARDC
 
CORBEL West-Life webinar slides
CORBEL West-Life webinar slidesCORBEL West-Life webinar slides
CORBEL West-Life webinar slidesCORBEL
 
Biomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital EnterpriseBiomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital EnterprisePhilip Bourne
 
Diagnostic hypothesis refinement in reproducible workflows for advanced medic...
Diagnostic hypothesis refinement in reproducible workflows for advanced medic...Diagnostic hypothesis refinement in reproducible workflows for advanced medic...
Diagnostic hypothesis refinement in reproducible workflows for advanced medic...Cezary Mazurek
 
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data eraScott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data eraGigaScience, BGI Hong Kong
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
International Cancer Genomics Consortium (ICGC) Data Coordinating Center
International Cancer Genomics Consortium (ICGC) Data Coordinating CenterInternational Cancer Genomics Consortium (ICGC) Data Coordinating Center
International Cancer Genomics Consortium (ICGC) Data Coordinating CenterNeuro, McGill University
 
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdfAlain van Gool
 

Semelhante a The W3C PROV standard: data model for the provenance of information, and enabler for trustworthy publication and exchange of open data (20)

Grand round whsiao_may2015
Grand round whsiao_may2015Grand round whsiao_may2015
Grand round whsiao_may2015
 
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality?  - William HsiaoHow Can We Make Genomic Epidemiology a Widespread Reality?  - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
 
Caulder - DIVOS BioITWorld 2015
Caulder - DIVOS BioITWorld 2015Caulder - DIVOS BioITWorld 2015
Caulder - DIVOS BioITWorld 2015
 
wolstencroft-ogf20-astro
wolstencroft-ogf20-astrowolstencroft-ogf20-astro
wolstencroft-ogf20-astro
 
WGBMEResume
WGBMEResumeWGBMEResume
WGBMEResume
 
MNTL000_2016 Review 8_RVSD
MNTL000_2016 Review 8_RVSDMNTL000_2016 Review 8_RVSD
MNTL000_2016 Review 8_RVSD
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early Thoughts
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Big data, big knowledge big data for personalized healthcare
Big data, big knowledge big data for personalized healthcareBig data, big knowledge big data for personalized healthcare
Big data, big knowledge big data for personalized healthcare
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen
 
CORBEL West-Life webinar slides
CORBEL West-Life webinar slidesCORBEL West-Life webinar slides
CORBEL West-Life webinar slides
 
Biomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital EnterpriseBiomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital Enterprise
 
Diagnostic hypothesis refinement in reproducible workflows for advanced medic...
Diagnostic hypothesis refinement in reproducible workflows for advanced medic...Diagnostic hypothesis refinement in reproducible workflows for advanced medic...
Diagnostic hypothesis refinement in reproducible workflows for advanced medic...
 
NRNB EAC Meeting 2012
NRNB EAC Meeting 2012NRNB EAC Meeting 2012
NRNB EAC Meeting 2012
 
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data eraScott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
 
Some Early Thoughts
Some Early ThoughtsSome Early Thoughts
Some Early Thoughts
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
International Cancer Genomics Consortium (ICGC) Data Coordinating Center
International Cancer Genomics Consortium (ICGC) Data Coordinating CenterInternational Cancer Genomics Consortium (ICGC) Data Coordinating Center
International Cancer Genomics Consortium (ICGC) Data Coordinating Center
 
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
2022-11-23 DTL Future of data-driven life sciences, Utrecht, Alain van Gool.pdf
 
CV_of_ArulMurugan (2017_01_18)
CV_of_ArulMurugan (2017_01_18)CV_of_ArulMurugan (2017_01_18)
CV_of_ArulMurugan (2017_01_18)
 

Mais de Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

Mais de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Último

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Último (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

The W3C PROV standard: data model for the provenance of information, and enabler for trustworthy publication and exchange of open data

  • 1. NII,Tokyo,July2014–PaoloMissier The W3C PROV standard: data model for the provenance of information, and enabler for trustworthy publication and exchange of open data Paolo Missier, PhD School of Computing Science Newcastle University Newcastle upon Tyne, UK NII, Tokyo, July, 2014
  • 2. NII,Tokyo,July2014–PaoloMissier Motivation: generating and publishing genomics data • Next Generation Sequencing at the forefront of genomics • the number of DNA base pairs that can be sequenced per $ doubles every five months (2010) • In the UK, the cost of sequencing a single patient sample is currently just under $1.5K and decreasing • Genetic testing: from research method to clinical diagnostic tool • Key technology: Whole-exome / Whole-genome processing pipelines (WEP/WGP) • Key problem: assessing the reliability of the results Goal of data processing and interpretation: to rapidly identify genetic mutations across the entire genome, which: • Have known associations to genetic diseases • Are unknown but potentially deleterious Specifically important in the study of rare diseases
  • 3. NII,Tokyo,July2014–PaoloMissier Data publication and reuse in science/biology/genomics Public, genome-wide gene expression data is potentially highly reusable Rung, Johan, and Alvis Brazma. “Reuse of Public Genome-Wide Gene Expression Data.” Nature Reviews. Genetics 14, no. 2 (March 2013): 89–99. doi:10.1038/nrg3394. But: • Published data must be provably correct, trustworthy Approximately half of the studies that use public gene expression data rely solely on existing data without adding newly generated data, and half of them use the public data in combination with new data. Problem: • A large WEP/ WES space, many experimental configurations, many possible results
  • 5. NII,Tokyo,July2014–PaoloMissier Multiple Workflow systems for implementing pipelines… [1] Torri, Federica, Ivo D Dinov, Alen Zamanyan, Sam Hobel, Alex Genco, Petros Petrosyan, Andrew P Clark, et al. “Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows.” Genes 3, no. 3 (August 30, 2012): 545–575. doi:10.3390/genes3030545. [2] Goecks, Jeremy, Anton Nekrutenko, and James Taylor. “Galaxy: A Comprehensive Approach for Supporting Accessible, Reproducible, and Transparent Computational Research in the Life Sciences.” Genome Biology 11, no. 8 (January 2010): R86. doi:10.1186/gb-2010-11-8-r86. [3] Reid, Jeffrey, Andreq Carroll, Narayanan Veeraraghavan, and Mahmoud Dahdouli. “Launching Genomics into the Cloud: Deployment of Mercury, a next Generation Sequence Analysis Pipeline.” BMC Bioinformatics (2014). Loni pipeline (UCLA, USA) [1] Newcastle, UK [4] Mercury [3] Baylor College of Medicine, Houston. Tx., USA) [2] [4] Watson, Paul, Hugo Hiden, and Simon Woodman. “E-Science Central for CARMEN: Science as a Service.” Concurrency and Computation: Practice and Experience 22, no. 17 (2010): 2369–2380. doi:10.1002/cpe.1611.
  • 6. NII,Tokyo,July2014–PaoloMissier Multiple Pipeline configurations Many tools to choose from, multiple ways to configure each tool From: Pabinger, Stephan, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, and Zlatko Trajanoski. “A Survey of Tools for Variant Analysis of next-Generation Genome Sequencing Data.” Briefings in Bioinformatics (January 21, 2013): bbs086–. doi:10.1093/bib/bbs086.
  • 7. NII,Tokyo,July2014–PaoloMissier … and different configurations yield very different results Outcomes are very sensitive to pipeline configuration False positives, false negatives The set of genetic mutations identified in one individual may vary greatly depending on the tools used Also: tools evolve over time  longitudinal variations over results
  • 8. NII,Tokyo,July2014–PaoloMissier The Cloud-e-Genome project Goal 1: provide mechanisms to rapidly and flexibly create new WEP pipelines, and to deploy them in a scalable way; Goal 2: provide clinicians with a tool for analysis and interpretation of human variants • 2 year pilot project • Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) Challenge: to deliver the benefits of WES/WGS technology to clinical practice NGS data processing Human variant interpretation for clinical diagnosis
  • 10. NII,Tokyo,July2014–PaoloMissier Pipeline evolution Pipeline: set C = { c1 … cn } of components -- tool wrappers Each ci has a configuration conf(ci) and a version v(ci) …and why • Technology / algorithm evolution • Traditional GATK variant caller  GATK haplotype caller • Does the interface change? • Do the operational assumptions change? Eg. GATK Variant Recalibrator requires large input data. Not suitable for targeted sequencing What can change 1 – Tool version: v(ci)  v’(ci) 2 - Tool replacement / add / remove: ci  c’I 3 – Configuration parameters conf(ci)  conf’(ci) (*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.” Briefings in bioinformatics, pp. bbs086–, Jan. 2013 Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while for variant annotation they refer to over 70 tools
  • 11. NII,Tokyo,July2014–PaoloMissier How do you know published results are sound? Mechanisms for data dissemination exist Data journals Data repositories Data structures: Research Objects (from ResearchObject.org) Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011). doi:doi:10.1016/j.future.2011.08.004. … but they are not enough to meet two key requirements: • Attribution of published data to its producers • Verifiability and reproducibility of scientific results
  • 12. NII,Tokyo,July2014–PaoloMissier Role of provenance Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) • Provenance is evidence in support of clinical diagnosis 1. Why do these variants appear in the output list? 2. Why have you concluded they are disease-causing? • Requires ability to trace variants through workflow execution • Workflow managers provide this “Why are these variants included in the results?” “Why do these two results differ?”
  • 13. NII,Tokyo,July2014–PaoloMissier Why does provenance matter? • To establish quality, relevance, trust • To track information attribution through complex transformations • To describe one’s experiment to others, for understanding / reuse • To provide evidence in support of scientific claims • To enable process analysis for debugging, improvement, evolution
  • 14. NII,Tokyo,July2014–PaoloMissier The W3C Working Group on Provenance: timeline 1 4 W3C Incubator group on provenance Chair: Yolanda Gil, ISI, USC W3C working group approved Chairs: Luc Moreau, Paul Groth 2009-2010 Main output: “Provenance XG Final Report” http://www.w3.org/2005/Incubator/prov/XGR-prov/ - provides an overview of the various existing approaches, vocabularies - proposes the creation of a dedicated W3C Working Group April, 2011 April, 2013 Proposed Recommendations finalised prov-dm: Data Model prov-o: OWL ontology, RDF encoding prov-n: prov notation prov-constraints ...plus a number of non-prescriptive Notes http://www.w3.org/2011/prov/wiki/
  • 15. NII,Tokyo,July2014–PaoloMissier PROV: scope and structure 1 5 source: http://www.w3.org/TR/prov-overview/ Recommendation track
  • 16. NII,Tokyo,July2014–PaoloMissier PROV Core Elements (graph depiction) 1 6 An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary. An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities. An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity. Jump to alternate
  • 17. NII,Tokyo,July2014–PaoloMissier Generation, Usage 1 7 Generation is the completion of production of a new entity by an activity. This entity did not exist before generation and becomes available for usage after this generation. Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had not begun to utilize this entity PROV is based on a notion of instantaneous events, that mark transitions in the world - generation, usage (and others) Ordering constraints amongst events: “generation of e must precede each of usages” “a can only use / generate e after it has started and before it has ended”
  • 18. NII,Tokyo,July2014–PaoloMissier Concepts and relations 1 8 Generation of “draft v1” expressed as relation: wasGeneratedBy(“draft v1”, ...) Usage of “draft v1” by “commenting” expressed as relation: used(“commenting, “draft v1”,...)
  • 19. NII,Tokyo,July2014–PaoloMissier PROV notation 1 9 document prefix prov <http://www.w3.org/ns/prov#> prefix ex <http://www.example.com/> entity(ex:draftComments) entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"]) entity(ex:paper1) entity(ex:paper2) activity(ex:commenting) activity(ex:drafting) wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00) used(ex:commenting, ex:draftV1, -) wasGeneratedBy(ex:draftV1, ex:drafting, -) used(ex:drafting, ex:paper1, -) used(ex:drafting, ex:paper2, -) endDocument
  • 20. NII,Tokyo,July2014–PaoloMissier Same example — PROV-O notation (RDF/N3) 2 0 :draftComments a prov:Entity ; :distr "internal"^^xsd:string ; prov:wasGeneratedBy :commenting . :commenting a prov:Activity ; prov:used :draftV1 . :draftV1 a prov:Entity ; :distr "internal"^^xsd:string ; :status "draft"^^xsd:string ; :version "0.1"^^xsd:string ; prov:wasGeneratedBy :drafting . :drafting a prov:Activity ; prov:used :paper1, :paper2 . :paper1 a prov:Entity, "reference"^^xsd:string . :paper2 a prov:Entity, "reference"^^xsd:string .
  • 21. NII,Tokyo,July2014–PaoloMissier Association, Attribution, Delegation: who did what? 2 1 An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity. Attribution is the ascribing of an entity to an agent. entity(ex:draftComments, [ ex:distr='internal' ]) activity(ex:commenting) agent(ex:Bob, [prov:type = "mainEditor"] ) agent(ex:Alice, [prov:type = "srEditor"]) wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"]) actedOnBehalfOf(Bob, Alice) wasAttributedTo(ex:draftComments, ex:Bob)
  • 22. NII,Tokyo,July2014–PaoloMissier Same example — PROV-O notation (RDF/N3) 2 2 :Alice a prov:Agent, "ex:chiefEditor"; :firstName "Alice"; :lastName "Cooper". :Bob a prov:Agent, "ex:seniorEditor"; :firstName "Robert"; :lastName "Thompson"^; prov:actedOnBehalfOf :Alice . :draftComments prov:wasAttributedTo :Bob . :drafting a prov:Activity ; prov:wasAssociatedWith :Bob .
  • 23. NII,Tokyo,July2014–PaoloMissier Association and Attribution 2 3 Q.: what is the relationship between attribution and association? This is defined as an inference rule in the PROV-CONSTR document entity(e) agent(Ag) activity(a) wasAttributedTo(e, Ag) wasGeneratedBy(e, a) wasAssociatedWith(a, Ag)
  • 24. NII,Tokyo,July2014–PaoloMissier Communication amongst activities 2 4 Communication is the exchange of some unspecified entity by two activities, one activity using some entity generated by the other. activity(ex:commenting) activity(ex:drafting) wasInformedBy(ex:commenting, ex:drafting) :drafting a prov:Activity . :commenting a prov:Activity ; prov:wasInformedBy :drafting .
  • 25. NII,Tokyo,July2014–PaoloMissier Communication, generation, usage 2 5 activity(ex:commenting) activity(ex:drafting) entity(e) wasInformedBy(ex:commenting, ex:drafting) wasGeneratedBy(e,ex:drafting) used(ex:commenting, e) Q.: what is the relationship between communication, generation, and usage? This are inference rules 5 and 6 in the PROV-CONSTR document
  • 27. NII,Tokyo,July2014–PaoloMissier Derivation amongst entities 2 7 A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity. entity(ex:draftV1) entity(ex:draftComments) wasDerivedFrom(ex:draftComments, ex:draftV1) Q.: what is the relationship between derivation, generation, and usage? :draftComments a prov:Entity ; prov:wasDerivedFrom :draftV1 . :draftV1 a prov:Entity .
  • 28. NII,Tokyo,July2014–PaoloMissier Relations may be given identifiers 2 8 entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -) used(use1; ex:commenting, ex:draftV1, -) gen1 denotes a generation event use1 denotes a usage event wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) General derivation relation: Relation IDs make it possible to refer to relations in other relations
  • 29. NII,Tokyo,July2014–PaoloMissier Rendering N-ary relations in PROV-O 2 9 RDF is for binary relations —- N-ary relations require reification entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(gen1; ex:draftComments, ex:commenting, 2013-03-18T10:00:01) used(use1; ex:commenting, ex:draftV1, -) :draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 . :gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00". :commenting a prov:Activity ; prov:qualifiedUsage :use1 . :use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.
  • 30. NII,Tokyo,July2014–PaoloMissier “Qualified relation” RDF pattern 3 0 :draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 . :gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00". :commenting a prov:Activity ; prov:qualifiedUsage :use1 . :use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.
  • 31. NII,Tokyo,July2014–PaoloMissier Plans — why was something done? 3 1 Most relation types have two arguments which are { Entity, Activity, Agent} Derivation is one exception: wasDerivedFrom(id; e2, e1, a, g2, u1, attrs) Two other notable exceptions: - Associations with a plan - Delegation with an activity scope wasAssociatedWith(id; a, ag, pl, attrs) A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goal
  • 32. NII,Tokyo,July2014–PaoloMissier Association with a plan 3 2 A plan plays a role in an association
  • 33. NII,Tokyo,July2014–PaoloMissier Plans are typed entities 3 3 activity(ex:_aProgramExecution, [ex:execTime="22.5sec"]) agent(ex:_aJVM, [prov:type = “JVM-6.0”]) entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label="Program 1"]) wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role="defaultRuntime", ex:accessPath="webapp" ]) A plan is an entity having prov:type = “prov:plan”
  • 34. NII,Tokyo,July2014–PaoloMissier Plan pattern as PROV-O 3 4 :_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] . :_aJVM a prov:Agent, “Java-6.0". :myCleverProgram a prov:Entity, prov:Plan. activity(ex:_aProgramExecution, [ex:execTime="22.5sec"]) agent(ex:_aJVM, [prov:type = “JVM-6.0”]) entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label="Program 1"]) wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role="defaultRuntime", ex:accessPath="webapp" ])
  • 35. NII,Tokyo,July2014–PaoloMissier Plan pattern as PROV-O 3 5 :_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] . :_aJVM a prov:Agent, “Java-6.0". :myCleverProgram a prov:Entity, prov:Plan.
  • 37. NII,Tokyo,July2014–PaoloMissier Real-world artifacts vs provenance entities 3 7 ref: http://www.w3.org/2001/sw/wiki/PROV-FAQ#Examples_of_Provenance “What do I know about the car I see in this Cambridge street today?” •It was bought by Joe in 2011 •Joe drove it to Boston on March 16th, 2013. The car has now got 10,000 miles on it •Joe drove it to Cambridge on March 18th, 2013. “Same” car, but different provenance at each stage of its evolution To Core Elements
  • 38. NII,Tokyo,July2014–PaoloMissier Alternate-specialization pattern 3 8 Two alternate entities present aspects of the same thing. These aspects may be the same or different, and the alternate entities may or may not overlap in time. An entity that is a specialization of another shares all aspects of the latter, and additionally presents more specific aspects of the same thing as the latter. ...But, this is still that car! Semantic notes: 1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2). 2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1) 3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3). To Core Elements differing in their location same owner, added location
  • 39. NII,Tokyo,July2014–PaoloMissier Reserved attributes and types 3 9 A small set of reserved attributes, with some usage restrictions
  • 40. NII,Tokyo,July2014–PaoloMissier Bundles, provenance of provenance 4 0 A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed. bundle pm:bundle1 entity(ex:draftComments) entity(ex:draftV1) activity(ex:commenting) wasGeneratedBy(ex:draftComments, ex:commenting,-) used(ex:commenting, ex:draftV1, -) endBundle ... entity(pm:bundle1, [ prov:type='prov:Bundle' ]) wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00) wasAttributedTo(pm:bundle1, ex:Bob)
  • 41. NII,Tokyo,July2014–PaoloMissier Bundles in PROV-O 4 1 Bundle definition (an RDF named graph): ex:bundle1 { :draftComments a prov:Entity ; :status “blah"; prov:wasGeneratedBy :commenting . :commenting a prov:Activity ; prov:used :draftV1 . :draftV1 a prov:Entity . } Bundle usage: ex:bundle1 a prov:Entity, "prov:Bundle"; prov:qualifiedGeneration [ a prov:Generation ; prov:atTime “2013-03-20T10:30:00+09:00" ]; prov:wasAttributedTo :Bob .
  • 42. NII,Tokyo,July2014–PaoloMissier Time, Events 4 2 wasStartedBy(id; a2, e, a1, t, attrs) wasEndedBy(id; a2, e, a1, t, attrs) Instead, the PROV data model is implicitly based on a notion of instantaneous events, that mark transitions in the world (*) (*) PROV-CONSTR http://www.w3.org/TR/prov-constraints/#events (non-normative) Events: - activity start, activity end, - entity generation , entity usage, entity invalidation - Provenance statements are combined by different systems - An application may not be able to align the times involved to a single global timeline Therefore, PROV minimizes assumptions about time
  • 43. NII,Tokyo,July2014–PaoloMissier From “scruffy” provenance to “valid” provenance 4 3 - Are all possible temporal partial ordering of events equally acceptable? - How can we specify the set of all valid orderings? More generally, how do we formally define what it means for a set of provenance statements to be valid? PROV defines a set of temporal constraints that ensure consistency of a provenance graph
  • 44. NII,Tokyo,July2014–PaoloMissier Exploiting provenance: why do my results differ from yours? Run pipeline version V1 V1  V2: Replace BWA version Modify Annovar configuration parameters Variant list VL1 Variant list VL2Run pipeline version V2 ?? Variant list VL1 Variant list VL2 DDIFF (data differencing) PDIFF (provenance differencing) Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013): doi:10.1002/cpe.3035.
  • 46. NII,Tokyo,July2014–PaoloMissier The corresponding provenance traces d1 S0 S0' w h S3 S2 y z S4 x k S1 d2 d1' S0 k'h' S3' S2v2 w' S3 S4 y' z' x' S5 d2 (i) Trace A (ii) Trace B P0 P1 P0 P1 P0 P0 P1P1 S Sv2 d0 d0
  • 47. NII,Tokyo,July2014–PaoloMissier Delta graph computed by PDIFF x, x y, y z, z w, w k, k S0 , S3 S0' S3' S1, S5 (service repl.) S2, S2v2 (version change) h, h S0' P0 branch of S4 P1 branch of S4 P0 branch of S2 P1 branch of S2 S,Sv2 (version change) S0, S0 d1, d1 PDIFF helps determine the impact of variations in the pipeline
  • 48. NII,Tokyo,July2014–PaoloMissier Provenance of Linked Open Data resources Goal: to establish a LD-compliant association between an LD resource and a description of its provenance • Where does the provenance of a LD resource live? • How can it be accessed? Why? 1. to enable LD search and discovery • By indexing data by its provenance • Ex. “Find all resources for which Alice is an author which contain data derived from dataset D” 1. To enable reasoning about quality/reliability of the LD resource • Predicates and rules over provenance • Ex. “if D has been derived from either {A,B,C} and Alice is one of the authors, then score  X”
  • 49. NII,Tokyo,July2014–PaoloMissier Provenance of Linked Open Data resources: how How: Three mechanisms: 1. Provenance Access and Query (PROV-AQ) – part of the W3C PROV recommendation suite 1. Embedding provenance statements within the resource itself • Eg the “Nanopublication” model 2. Packaging data + provenance as a Research Object
  • 50. NII,Tokyo,July2014–PaoloMissier 1. Provenance pingback and query service Image reproduced from: De Nies, Tom, Robert Meusel, Kai Eckert, Dominique Ritze, and Anastasia Dimou. “A Lightweight Provenance Pingback and Query Service for Web Publications.” In Procs. IPAW 2014. Cologne, Germany: Springer, 2014. Objective: to decouple publishing of content and of its provenance (as LOD) Scenario: • Publishers publish content resources, are not responsible for provenance • Eg. Mendeley, ResearchGate, etc. • Authors publish provenance, are not responsible for publishing content
  • 51. NII,Tokyo,July2014–PaoloMissier 2. Provenance Embedding The nanopublication model is an example of provenance embedding within a published RDF document From nanopub.org: A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author. Individual nanopublications can be cited by others and tracked for their impact on the community.
  • 52. NII,Tokyo,July2014–PaoloMissier Nanopublication: example Assertion: an “association” between a gene and a genetic disorder. The strength of this association is given by a statistical p-value. See nanopub.org for details { : a nanopub:Nanopublication ; nanopub:hasAssertion :NanoPub_1_Assertion ; nanopub:hasProvenance :NanoPub_1_Provenance . :NanoPub_1_Provenance nanopub:hasAttribution :NanoPub_1_Attribution ; nanopub:hasSupporting :NanoPub_1_Supporting . :NanoPub_1_Assertion a nanopub:Assertion . :NanoPub_1_Provenance a nanopub:Provenance . :NanoPub_1_Attribution a nanopub:Attribution . :NanoPub_1_Supporting a nanopub:Supporting . } :NanoPub_1_Assertion { :Association_1 a sio:statistical-association ; sio:has-measurement-value :Association_1_p_value ; sio:refers-to ... } :NanoPub_1_Attribution { :pav:authoredBy res_a, reS_b. :NanoPub_1_Assertion pav:createdBy ...; } :NanoPub_1_Supporting { :Association_1 opm:wasDerivedFrom gene_disease_concept_profiles_1980_2010...; opm:wasGeneratedBy gene_disease_concept_profiles_matching_1980_2010; . }
  • 53. NII,Tokyo,July2014–PaoloMissier 3. Research Objects for data and provenance packaging Research Objects (ROs) are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. A Research Object is a combination of: • Aggregation (reusing Object Reuse and Exchange [ORE]) • Annotation (reusing the Annotation Ontology [AO]) • RO ontologies From the Wf4Ever EU project See also: Belhajjame K, Corcho O, Garijo D, Zhao J, Missier P, Newman DR, Palma R, Bechhofer S et al.: Workflow-Centric Research Objects: A First Class Citizen in the Scholarly Discourse. In proceedings of the ESWC2012 Workshop on the Future of Scholarly Communication in the Semantic Web (SePublica2012), Heraklion, Greece, May 2012
  • 54. NII,Tokyo,July2014–PaoloMissier Links to resources cited in the talk • The PROV Data Model (PROV-DM): www.w3.org/TR/prov-dm/ • A primer on PROV with a simple running example: http://www.w3.org/TR/prov-primer/ • LD and PROV: • Nanopublications: nanopub.org • Research Objects: researchobject.org • The Wf4Ever project: www.wf4ever-project.org • PROV Access and Query conventions (PROV-AQ): http://www.w3.org/TR/prov-aq/ • Visualising provenance using PROV-O-Viz: http://provoviz.org/ • PROV-O-Viz video: • PROV-O-Viz IPAW’14 paper preprint: http://dare.ubvu.vu.nl/handle/1871/51388 • Reference: Hoekstra, Rinke, and Paul Groth. “PROV-O-Viz - Understanding the Role of Activities in Provenance.” In Procs. IPAW 2014. Springer, 2014.

Notas do Editor

  1. Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals. Azure: 10 L instances/ 24h a day. / 30 TB/year. / 10 GB of SQL Azure space / 30-­‐100 TB
  2. E-Science Central Integrate multiple runtime environments - R, Octave, Java, Javascript, (Perl)
  3. Traditional Variant Callers Go through the whole genome to identify locations where a number of non-reference bases appears to call SNPs Gapped mapping to identify INDELs Different algorithms to calculate SNP and INDELs likelihoods GATK HaplotypeCaller Haplotype-based calling Call SNPs and indels simultaneously by performing a local de-novo assembly Same algorithm for SNPs and Indels likelyhoods Artifacts caused by large INDELs recovered by assembly
  4. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  5. W3C Recommendation (REC) A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.
  6. remark on PROV-AQ: nothing to do with querying, but a query model can be associated to each of the encodings W3C Recommendation (REC) A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations. Working Group Note A Working Group Note is published by a chartered Working Group to indicate that work has ended on a particular topic. A Working Group may publish a Working Group Note with or without its prior publication as a Working Draft.
  7. Alice, a senior editor, produces draft V1 of a document, after reading papers paper1 and paper2. v1 is for internal distribution only Later, Bob who is the main editor and works for Alice, commented on the draft, producing a new document, draft comments
  8. duality between elements (generation) and relations (wasGeneratedBy)
  9. baseline-noAgents.provn
  10. Most relations admit optional arguments (e.g. time) First-class arguments may be optional, too. For instance, generation with implicit activity Often only some combinations of arguments are legal
  11. A single (real world) artifact may correspond to several entities in a provenance model that includes descriptions of such artifact. The choice of mapping is determined by which characteristics of the artifact are relevant for (a specific) provenance description of it Whenever one of these attributes changes, a new entity is created ex.: the doc before and after editing. Some characteristics that are relevant for provenance have changed.
  12. These entities are however related These relationships can be expressed in PROV
  13. ... and I could have bundles that refer to other bundles...
  14. Note: Provenance as publishable Linked Data is trivial…