Provenance models are crucial for describing experimental results in science. The W3C Provenance Working Group has recently released the PROV family of specifications for provenance on the Web. While provenance focuses on what is executed, it is important in science to publish the general methods that describe scientific processes at a more abstract and general level. In this paper, we propose P-PLAN, an extension of PROV to represent plans that guid-ed the execution and their correspondence to provenance records that describe the execution itself. We motivate and discuss the use of P-PLAN and PROV to publish scientific workflows as Linked Data.
PhD Thesis: Mining abstractions in scientific workflows
P-Plan
1. Augmenting PROV with Plans in P-PLAN:
Scientific Processes as Linked Data
Daniel Garijo Yolanda Gil
OEG-DIA Information Sciences Institute and
Facultad de Informática Department of Computer Science
Universidad Politécnica de Madrid University of Southern California
dgarijo@delicias.dia.fi.upm.es http://www.isi.edu/~gil
USC Information Sciences Yolanda Gil gil@isi.edu 1
2. W3C PROV
http://www.w3.org/2011/prov/
USC Information Sciences Yolanda Gil gil@isi.edu 2
3. A Workflow Execution
in PROV
Benefits:
• Makes the work
inspectable
Shortcomings:
• Hard to reproduce
• Not efficient to reuse
USC Information Sciences Yolanda Gil gil@isi.edu 3
5. Replication of Crohn’s Disease Association
Study from [Duerr et al, Science 06]
USC Information Sciences Yolanda Gil gil@isi.edu 5
6. Replication of Early-Onset Parkinson’s Disease
Study from [Bayrakli et al, Human Mutation 07]
USC Information Sciences Yolanda Gil gil@isi.edu 6
7. Reusability
Lower cost
• “Scientists and engineers spend more than
60% of their time just preparing the data
for model input or data-model
comparison” (NASA A40)
Better quality
• “We write QC without thinking about the
best way to do the WC. Such approaches
perpetuate mediocrity. If someone did it
right once, it would benefit many people.”
(EC WF CQ)
More efficient
• “I often see that I’m repeating the work
that 100 other people have been doing to
obtain and process the data.” (EC WF CQ)
USC Information Sciences Yolanda Gil gil@isi.edu 7
8. Access to Data Analytics Expertise [Science 2011]
USC Information Sciences Yolanda Gil gil@isi.edu 8
9. The TB-Drugome [Kinnings et al., PLoS CompBio 2010]
“We report a computational
approach to construct a
drug-target network…
applied to the genome of
tuberculosis…”
“The TB-drugome reveals
that approximately one-
third of the drugs examined
have the potential to… treat
tuberculosis…”
“The methodology can be
applied to other pathogens
of interest …”
USC Information Sciences Yolanda Gil gil@isi.edu 9
10. Executable and Abstract Workflow
What I actually run The method that I followed
USC Information Sciences Yolanda Gil gil@isi.edu 10
11. The Ontology for Biomedical Investigations
http://obi-ontology.org/
USC Information Sciences Yolanda Gil gil@isi.edu 11
12. Semantic Web Applications in Neuromedicine
(SWAN) Ontology http://www.w3.org/TR/hcls-swan/
USC Information Sciences Yolanda Gil gil@isi.edu 12
14. Executable and Abstract Workflow
What I actually run The method that I followed
USC Information Sciences Yolanda Gil gil@isi.edu 14
15. Semantic Workflows in Wings
[Gil et al 10][Gil et al 09][Kim & Gil et al 08][Kim et al 06]
Workflows are augmented with
semantic constraints
• Each workflow constituent has a
variable associated with it
– Workflow components, arguments,
datasets
• Constraints are used to restrict
workflow variables
• Can define abstract classes of
components
– Concrete components model exec. codes
Workflow reasoners propagate and
use semantic constraints
Uses semantic web standards:
OWL/RDF, SPARQL, rules
USC Information Sciences Yolanda Gil gil@isi.edu 9 15
16. Ontologies for Data and Workflow Components
Documents Correlation
Language Scoring
Plain Markup
text InDoc En ChiSq InfoGain MutInfo
Fr
htmlDoc Modeler
Model
latexDoc
DecTree Linear
Dec Modeler
Size Regression
Feature Tree
Vector SVM C4.5 J48
WSJ-2010 MatLab_LR R_LR
Weka-C4.5
USC Information Sciences Yolanda Gil gil@isi.edu 16
17. Semantic Workflows: Abstractions Based on
Ontologies [Gil et al 2011]
TF-IDF CODE
Term Weighting
Chi Squared CODE
Correlation Scoring
USC Information Sciences Yolanda Gil gil@isi.edu 17
18. Publishing Workflows on the Web with OPMW
http://www.opmw.org
Red: OPM model Extension of the Open Provenance Model
Black: OPMW profile (extension)
hasArtifactTemplate
Artifact account
Artifact Artifact Artifact
Input Input hasArtifactTemplate Execution Execution
artifact1 artifact2 Input1 Input2
used account
used user used
hasArtifact wasControlledBy account
used Process
Workflow Abstract template Agent account Execution
Execution Node
template Node hasProcessTemplate account
hasProcess hasAbstractComponent
hasSpecificComponent Process Account
OPM hasArtifact
wasGeneratedBy
Abstract subClassOf Specific account
Graph component component wasGeneratedBy
Output hasArtifactTemplate Execution
artifact1 result
Artifact Artifact
hasWorkflowTemplate
Workflow Template Execution Results
USC Information Sciences Yolanda Gil gil@isi.edu 18
19. Published as Linked Data: Executed Workflow
+ Abstract Workflow + Data + Steps + Codes…
USC Information Sciences Yolanda Gil gil@isi.edu 19
20. P-PLAN: Extending PROV to represent
plans
Plan representations can be very complex
• Iteration, conditionals, decomposition, etc.
P-PLAN is a core representation with only:
• Sequences of steps
• Parallel steps
P-PLAN, like PROV, is a DAG
• Simplest representation of plans
USC Information Sciences Yolanda Gil gil@isi.edu 20
22. Queries about Workflows Published as
Linked Data
Find all abstract workflows (?plan) in which a
given entity (?entity) has been used when
executing them
SELECT DISTINCT ?plan WHERE {
?entity a p-plan:Entity,prov:Entity;
p-plan:correspondsTo ?templVariable.
?templVariable a p-plan:Variable;
p-plan:isVariableOfPlan ?plan.}
USC Information Sciences Yolanda Gil gil@isi.edu 22
23. Conclusions
Linked data as a vehicle to publish science processes
• Workflows, experiments, …
Important to publish method, not just provenance
• Reproducibility, efficiency, access to expertise
W3C PROV useful to publish execution
P-PLAN is an extension of PROV for publishing methods
• Plan, step, variable
P-PLAN is applicable beyond science
USC Information Sciences Yolanda Gil gil@isi.edu 23