Scripting languages like Python, R, andMATLAB have seen significant use across a variety of scientific domains. To assist scientists in the analysis of script executions, a number of mechanisms, e.g., noWorkflow, have been recently proposed to capture the provenance of script executions. The provenance information recorded can be used, e.g., to trace the lineage of a particular result by identifying the data inputs and the processing steps that were used to produce it. By and large, the provenance information captured for scripts is fine-grained in the sense that it captures data dependencies at the level of script statement, and do so for every variable within the script. While useful, the amount of recorded provenance information can be overwhelming for users and cumbersome to use. This suggests the need for abstraction mechanisms that focus attention on specific parts of provenance relevant for analyses. Toward this goal, we advocate that fine-grained provenance information recorded as the result of script execution can be abstracted using user-specified, workflow-like views. Specifically, we show how the provenance traces recorded by noWorkflow can be mapped to the workflow specifications generated by YesWorkflow from scripts based on user annotations. We examine the issues in constructing a successful mapping, provide an initial implementation of our solution, and present competency queries illustrating how a workflow view generated from the script can be used to explore the provenance recorded during script execution.
2. Scientists process and analyze their datasets using
scripts.
Retrospective provenance information can be useful for
scientists to analyze of script execution
E.g., noWorkflow [Murta et al., IPAW’14]
While useful, the amount of information such fine-
grained traces contain can be overwhelming for end
users.
There is a need for abstraction techniques that focus the attention
of users on the provenance information relevant for their analyses.
TaPP'15 2
3. Workflow-Oriented Proposals
Zoom*UserViews [Biton et al., ICDE’08]:Workflow views are used for
abstracting the retrospective provenance of (potentially complex) workflows
LabelFlow [Alper et al., IPAW’14]: Both (prospective and retrospective)
workflow provenance are summarized using user annotations and reduction
rules.
Script-Oriented Proposals
YesWorkflow [McPhillips et al.,TaPP’15]:We have seen in the presentation by
B. Ludäscher, how a URI scheme can be used to reconstruct part of the
retrospective provenance, without recording it at run time.
This scheme is useful when the data products used and generated,
including intermediate ones, are stored within the file system.
We adopt a similar approach as Zoom*UserViews and apply it to scripts:
We use the workflow specifications (i.e., prospective provenance) extracted by
YesWorkflow to abstract the retrospective provenance captured by noWorkflow
TaPP'15 3
8. Variable names are not reliable IDs.
Two different variables may have the same name within a
script, e.g., within the scope of different functions
noWorkflow tracks line numbers in the script to identify the
variable
YesWorkflow does not provide this information
However, it provides the line numbers of the start and end of a
block, when requested.
Using the two pieces of information, we are able to connect data
values in noWorkflows to their correspondingYesWorkflow
variables.
TaPP'15 8
9. The variables inYesWorkflows are associated with user
annotations
Such annotations can be used to provide more information
about the data values within noWorkflow retrospective
provenance
We note here that it is possible inYesWorkflow to specify
variables that are not mapped to any variables within the
script
TaPP'15 9
10. -
d
d
e
d
r
y
y
e
-
1 # @begi n model
2 # @i n a @as past Temper at ur eDat a
3 # @i n b @as past Pr eci pi t at i onDat a
4 # @out dat a @as si mul at edWeat her
5 i f ( sum( a) / l en( a) ) < 0:
6 dat a = model 1( a, b)
7 el se:
8 dat a = model 2( a, b)
9 # @end model
Figure3. Annotating anif-then-elsecontrol block using YesWork-
flow
model
pastTemperatureData pastPrecipitationData
simulatedWeather
e
t
d
e
)
n
-
-
-
-
d
s
s
n
-
a
d
-
e
t
s
s
24 ## @begi n mai n
25 # @i n dat a1. dat @as t emper at ur eDat aFi l e @URI
f i l e: t emp. dat
26 # @i n dat a2. dat @as pr eci pi t at i onDat aFi l e @URI
f i l e: pr eci p. dat
27 # @out out put . png @as pl ot
28
29 ## @begi n r ead_f i l e_1
30 # @i n dat a1. dat @as t emper at ur eDat aFi l e @URI
f i l e: t emp. dat
31 # @out a @as past Temper at ur eDat a
32 a = r ead_f i l e( " dat a1. dat " )
33 ## @end r ead_f i l e_1
34
35 ## @begi n r ead_f i l e_2
36 # @i n dat a2. dat @as pr eci pi t at i onDat aFi l e @URI
f i l e: pr eci p. dat
37 # @out b @as past Pr eci pi t at i onDat a
38 b = r ead_f i l e( " dat a2. dat " )
39 ## @end r ead_f i l e_2
40
41 i f sum( a) / l en( a) < 0:
42 ## @begi n model _1
43 # @i n a @as past Temper at ur eDat a
44 # @i n b @as past Pr eci pi t at i onDat a
45 # @out dat a @as si mul at edWeat her
46 dat a = model 1( a, b)
47 ## @end model _1
48 el se:
49 ## @begi n model _2
50 # @i n a @as past Temper at ur eDat a
51 # @i n b @as past Pr eci pi t at i onDat a
52 # @out dat a @as si mul at edWeat her
53 dat a = model 2( a, b)
54 ## @end model _2
55
56 ## @begi n ext r act _t emper at ur e
model_1 Model_2
simulatedWeather
TaPP'15 10
11. We found gaps in the noWorkflow dependency graph (when
exported to prolog)
Function returns do not always link back to correct
variable.
An object that is modified (e.g via a list.append call) is not
always captured in the dependency graph
We also observed that noWorkflow does not capture
information about the name of the script in the retrospective
provenance
This is required for uniquely identifying variables.
TaPP'15 11
12. An object may be nested, i.e., it may have
other objects as is children.
Any change to the child object is attributed to
the parent object.
noWorkflow attributes any changes to any
child object to the ultimate parent.
TaPP'15 12
16. Q1: find the temperature file used.
fileName(N) :-
map(V,A),
A=“temperatureDataFile”,
iData(V,N).
Q2: show how the temperature file was used by.
fileUsedBy(X,Y) :-
iDepAbs(X,Y),
map(X,A),
A=“temperatureDataFile”.
fileUsedBy(X,Y) :-
fileUsedBy(X,Z).
iDepAbs(Z,Y).
TaPP'15 16
17. We have implemented an approach for linking the retrospective
provenance of script to a more abstracted and user defined
prospective provenance.
This solution is complementary to theYesWorkflow solution for
capturing retrospective provenance of data files [1]
The issues that we came across usingYesWorkflow and
noWorkflow were communicated to the development teams of
these tools to address them
Linking prospective and retrospective provenance for multiple
scripts as opposed to a single script
Designers may organize the implementation their data analyses
into multiple scripts.
[1]T. M. McPhillips et al. Retrospective provenance without a run-time provenance recorder. InTapp, 2015
TaPP'15 17
Despite the popularity of scientific workflows as a means to specify and enact in silico experiments, the majority of scientists still process and analyze their datasets using scripts.
To assist scientists in the analysis of script executions, retrospective provenance information specifying, amongst other things, the lineage of script results and the transformations they underwent, can be useful in, e.g., assessing the validity of the hypothesis the scientist is investigating.
While useful, the amount of information such fine-grained traces contain can be overwhelming for end users. This demonstrates the need for abstraction techniques that focus the attention
of the user on the provenance information that is relevant for her analysis. We stress here that recording fine-grained, statement based provenance can be crucial for understanding the lineage of scripts results. However, we argue that in order for such provenance information to be readily accessible and useful for end users, it needs to be abstracted and filtered to match the questions users want answered. Indeed, end users are often interested in the lineage of a small subset of the data artifacts manipulated and generated by the script. Moreover, by and large, they are not keen on examining the transformations such data artifacts underwent at the statement level.
Linking identifiers. YesWorkflow generates provenance using identifiers from the user-defined annotations, but noWorkflow captures provenance using the script’s variable names. While YesWorkflow does have methods to specify the link between the variable name and the user-defined identifier, it does not check that the annotation references an actual variable.
How about the control flow at the level of the inputs. Should it be supported? --khalid
Data dependencies. There can be gaps in the noWorkflow dependency graph (when exported to prolog). Specifically, function returns do not always link back to correct variable. Furthermore, an object that is modified (e.g via a list.append call) is not always captured in the dependency graph.
No script name. Same variable names may be used in different scripts. The script names along with the line numbers would be required to unambiguously identify a variable. Currently, noWork- flow does not capture the script name.
The change is not captured at the lowest level (finest objects).
Add the reaf_file function script to the slide.
map(X; Y): X is the variable name and Y is the user defined annotation.
iDep(X,Y): Y depends on X in the retrospective provenance.
nodeInt: node of interest
(5) And (6) Transitive reduction
wData(X,V): X is the identifier or the variable name and V is the value of the annotation
Note that: I am mixing variable and identifier here. Not a good thing, but, I had to save space. I can take care of that in the Journal version.
The approach proposed in [1] does not incur any cost and is specific to data files. If the user want to explore the lineage of data artifacts at a granularity that is finer than data files or data artefacts that are not stored within the file system, then the approach presented in this paper can be used.