Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Invited talk @ DCC09 workshop
1. Scientific Workflow Management System
Janus
Provenance
Research objects, myExperiment, and
Open Provenance for collabora;ve E‐science
REPRISE workshop ‐ IDCC’09
Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK
with additional material by Sean Bechhofer and Matthew Gamble,
e-Labs design group, University of Manchester
1
IDCC’09, London - P.Missier
2. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–
169 (2009)
Prepublication data sharing:
Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9
September 2009 http://www.nature.com/news/specials/datasharing/index.html 2
IDCC’09, London - P.Missier
3. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–
169 (2009)
Prepublication data sharing:
Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9
September 2009 http://www.nature.com/news/specials/datasharing/index.html 2
IDCC’09, London - P.Missier
4. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
• Ongoing debate in several communities
– Clinical trials [1]
– Earth Sciences -- ESIP - data preservation / stewardship, 2009
– Long established in some communities - Atmospheric sciences,
1998 [2]
• Science Commons recommendations for Open Science
– Open Science recommendations from Science Commons (July 2008) [link]
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–
169 (2009)
Prepublication data sharing:
Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9
September 2009 http://www.nature.com/news/specials/datasharing/index.html 2
IDCC’09, London - P.Missier
12. Collaboration through data
What is needed for B to make sense of A’s data?
1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance
3.Container:
– a repository for Research Objects 4
IDCC’09, London - P.Missier
13. Collaboration through data
What is needed for B to make sense of A’s data?
1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance
3.Container:
– a repository for Research Objects 4
IDCC’09, London - P.Missier
14. Collaboration through data
What is needed for B to make sense of A’s data?
1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance
3.Container:
– a repository for Research Objects 4
IDCC’09, London - P.Missier
15. Collaboration through data
What is needed for B to make sense of A’s data?
1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance
3.Container:
– a repository for Research Objects 4
IDCC’09, London - P.Missier
17. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Results
Common pathways
18. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Representation
Results
Common pathways
19. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Representation
Results Domain Relations
Common pathways
20. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
produces Published in
Representation
Results Domain Relations
Common pathways
21. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
produces Published in
Representation
Results Domain Relations
Aggregation
Common pathways
22. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
Metadata produces Published in
Representation
Results Domain Relations
Aggregation
Common pathways
23. ORE: representing generic aggregations
Resource Map Data structure
(descriptor)
http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations:
Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for
Information Science and Technology (JASIST), to appear, 2009.
6
24.
25. Content: Workflow provenance
A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced
8
26. Content: Workflow provenance
A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced
8
27. Content: Workflow provenance
A detailed trace of workflow execution
lister
- tasks performed, data transformations
get pathways
by genes1 - inputs used, outputs produced
merge pathways
gene_id
concat gene pathway ids
output
pathway_genes
8
28. Why provenance matters, if done right
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
The W3C Incubator on Provenance has been collecting numerous use cases:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
IDCC’09, London - P.Missier
29. What users expect to learn
• Causal relations:
- which pathways come from which genes?
- which processes contributed to producing an
lister image?
- which process(es) caused data to be incorrect?
get pathways
by genes1
- which data caused a process to fail?
merge pathways • Process and data analytics:
– analyze variations in output vs an input
gene_id parameter sweep (multiple process runs)
– how often has my favourite service been
concat gene pathway ids executed? on what inputs?
– who produced this data?
output
– how often does this pathway turn up when the
input genes range over a certain set S?
pathway_genes
10
IDCC’09, London - P.Missier
30. Open Provenance Model
• graph of causal dependencies involving data and processors
• not necessarily generated by a workflow!
• v1.0.1 currently open for comments
wasGeneratedBy (R)
A P
Goal:
used (R)
P A standardize causal dependencies
to enable provenance metadata exchange
wgb(R5)
A1 wgb(R1) used(R3) A3 P1
P3
wgb(R6)
A2 wgb(R2) used(R4) A4 P2
11
IDCC’09, London - P.Missier
31. The 3rd provenance challenge
• Chosen workflow from the Pan-STARRS project
– Panoramic Survey Telescope & Rapid Response Syste
• http://twiki.ipaw.info/bin/view/Challenge/
ThirdProvenanceChallenge
• Goal:
– demonstrate “provenance interoperability” at query level
12
IDCC’09, London - P.Missier
35. OPM and query-interoperability
Team A
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
prov(WA)
Team B
Q(PWA)
PWA =
import(OPM(prov(WA)))
execute
import
query Q
14
36. OPM and query-interoperability
Team A
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
prov(WA)
?
Team B
Q(PWA)
PWA =
import(OPM(prov(WA)))
execute
import
query Q
14
41. Additional requirements
• Artifact values require uniform common identifier
scheme
– each group used artifacts to refer to its own data results
– but those results were expressed using proprietary
naming conventions
– Linked Data in OPM?
16
42. Additional requirements
• Artifact values require uniform common identifier
scheme
– each group used artifacts to refer to its own data results
– but those results were expressed using proprietary
naming conventions
– Linked Data in OPM?
• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes
16
43. Additional requirements
• Artifact values require uniform common identifier
scheme
– each group used artifacts to refer to its own data results
– but those results were expressed using proprietary
naming conventions
– Linked Data in OPM?
• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large
– reduce size by exporting only query results
• Taverna approach
– multiple levels of abstraction
• through OPM accounts (“points of view”) 16
44. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
prov(WA)
45. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
prov(WA)
46. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
Q(prov(WA))
47. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(Q(prov(WA))) export Q(prov(WA))
Q(prov(WA))
48. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(Q(prov(WA))) export Q(prov(WA))
Q(prov(WA))
- Approach implemented in Taverna 2.1
- Internal provenance DB with ad hoc query language
- To be released soon
52. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
workflow B+
input B
Research
Object result
exp. B result B
provenance
result A → input B datasets B
B
18
53. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
18
54. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
Provenance composition
accounts for implicit
collaboration
18
55. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
Provenance composition
accounts for implicit
collaboration
Aligned with focus of upcoming Provenance Challenge 4:
“connect my provenance to yours" into a whole OPM provenance graph.
18