1. On Answering Why-Not Queries Against Scientific
Workflow Provenance
Khalid Belhajjame
PSL Research University, Paris-Dauphine University, LAMSADE, Paris, 75016, France
khalid.belhajjame@dauphine.fr
July 13, 2018
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 1 / 26
2. Context: Scientific Workflows
Scientific workflows have been
shown to facilitate and accelerate
scientific data exploration and
analysis in many areas of sciences,
including proteomics, metabolics,
astronomy, and bio-medicine.
The figure on the right side
illustrates an example of a simple
workflow used for identifying the
pathways associated with a given
input metabolite (compound).
Given a compound identifier, the
first module returns a compound
name, which is used to feed the
second module to obtain the
corresponding pathway.
Workflow input ports
Workflow output ports
compound_id
get_compound_info
output_pathways
extract_pathway_from_compounds_file
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 2 / 26
3. Aim: Evaluating Why-Not Queries Against Workflow
Executions
Why-not queries help scientists understand why a given data item,
e.g., their favorite biological pathway, was not returned by the
workflow executions.
While answering such queries has been thoroughly investigated for
relational databases, only a few proposals examined their evaluation
in the context of scientific workflows.
Objective: To elaborate a solution for evaluating why not queries
against workflows with black-box modules.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 3 / 26
4. Related Work: Database (Querying) Land
Instance-based attempts to find the data items in the inputs that are
responsible for the non appearance of a given data item in the result.
Consider the example below (taken from Huang et al. VLDB 2008).
The query returns the schools in the state of California are within the top 4
and have job openings.
The answer returned by the query is Stanford and its rank in the result.
Why-not query: Why does Berkley not appear in he results?
What change shall I make to the source to obtain (Homer, 25) in the results?
if a potential tuple (berkeley, ca, yes) is inserted into the openings table,
Berkeley will become an answer
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 4 / 26
5. Related Work: Database (Querying) Land
Module-based attempts to identify the modules (sub-queries) that
are responsible for the non-appearance of a given data item in the
workflow results.
In the case of the previous example, we have only one join, which is
responsible in this case for the non appearance of Berkley in the result
set of the query.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 5 / 26
6. Related Work: Workflow Land
The only proposal in this category for workflow provenance is the
Why-Not algorithm proposed by Chapman and Jagadish 2009.
Using the Why-Not algorithm proposed by Chapman and Jagadish,
the user query is expressed as a set of atomic predicates that are
combined using AND and OR.
Chapman and Jagadish make the assumption that the attributes of
the input datasets are preserved by the modules that compose the
workflow.
This is not the case, however, in the general case.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 6 / 26
7. Related Work: Workflow Land
For example, the modules in the workflow
illustrated on the right do not preserve the
attribute of the input, viz. Compound − ID,
in that the output of the first and the
second module do not contain information
about the compound identifier.
In the work presented in this talk, we drop
the assumption made by Chapman and
Jagadish, and propose a solution that can
be utilized for answering why-not queries
for workflow with modules that do not
preserve attributes of the input datasets.
Furthermore, unlike the Why-Not
algorithm which is module-based, our
proposal is hybrid in that it seeks to
answer instance- and module-based
why-not queries.
Workflow input ports
Workflow output ports
compound_id
get_compound_info
output_pathways
extract_pathway_from_compounds_file
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 7 / 26
8. Foundations
Why-not query: A user specifies a why-not query by providing a
data item dwhy−not that has the same data type as the output of the
last module of the workflow and was not returned by the workflow
executions.
Module pickyness: Central to the evaluation of why-not queries is
the pickyness of its modules. A module M in a workflow is picky with
respect to a data item d if its inverse Minv does not accept d as
input. More specifically, Minv throws an illegal input exception when
its execution is fed d.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 8 / 26
9. Processing Why-Not Queries
The algorithm for processing why-not queries, takes as input a data item
dwhy−not specified by the user
To answer a why-not query, the modules of the workflow are explored from
the sink to the source in a breadth-first fashion. To do so, we group the
workflow modules into levels as illustrated in the figure below.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 9 / 26
10. Processing Why-Not Queries
The modules of each level are examined to identify if the module is picky.
Specifically, the inverse of the module in question M is examined to check
if:
1 It does not accept the corresponding data items that were generated
by the inverse of the modules in the previous level.
2 It accepts the corresponding data items that were generated by the
inverse of the modules in the previous modules.
In this case, the data items the inverse of M produces are saved to be
used to feed the inverse of the modules in the succeeding levels, if any.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 10 / 26
11. Identifying Picky Modules
To identify if a module M is picky, we need to invoke its inverse Minv ,
and check if it accepts the data items in question.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 11 / 26
12. Identifying Picky Modules
To identify if a module M is picky, we need to invoke its inverse Minv ,
and check if it accepts the data items in question.
However, the inverse module rarely exists.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 12 / 26
13. Identifying Picky Modules
To identify if a module M is picky, we need to invoke its inverse Minv ,
and check if it accepts the data items in question.
However, the inverse module rarely exists.
To overcome the non-existence of the inverse module, we can probe
the modules until we have the output we are after, or else fail and
deduce that the module in question is picky.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 13 / 26
14. Identifying Picky Modules
To identify if a module M is picky, we need to invoke its inverse Minv ,
and check if it accepts the data items in question.
However, the inverse module rarely exists.
To overcome the non-existence of the inverse module, we can probe
the modules until we have the output we are after, or else fail and
deduce that the module in question is picky.
This is not a reasonable solution because the space of valid input
values of a module can be very large or even infinite. The problem is
exacerbated by the fact that a module may have multiple inputs,
therefore requiring the construction of all possible combination for
probing.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 14 / 26
15. Identifying Picky Modules
To identify if a module M is picky, we need to invoke its inverse Minv ,
and check if it accepts the data items in question.
However, the inverse module rarely exists.
To overcome the non-existence of the inverse module, we can probe
the modules until we have the output we are after, or else fail and
deduce that the module in question is picky.
This is not a reasonable solution because the space of valid input
values of a module can be very large or even infinite. The problem is
exacerbated by the fact that a module may have multiple inputs,
therefore requiring the construction of all possible combination for
probing.
Is there a more reasonable solution... that at least allows us to probe
the modules using fewer inputs?
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 15 / 26
16. Identifying Picky Modules by Harvesting the Web
A solution that we explored consist in harvesting the (probably)
biggest source of information, namely the Web using the information
extraction process illustrated below.
Indeed, an important number of scientific modules that are provided
by major institutions, such as the EBI and DDBJ, provides also for
users the means to invoke these modules on the web, and the traces
of those module invocation remains in a number of cases accessible
on the Web.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 16 / 26
17. Identifying Picky Modules by Harvesting the Web
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 17 / 26
18. Identifying Picky Modules by Harvesting the Web
If none of the candidate inputs is
found to be true positive, then we
conclude that the module is likely to
be picky.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 18 / 26
19. Feasibility Study
The approach we have just described raises the following question. Is
the algorithm proposed able to identify the reason why a given data
item does not appear in the work!ow results? More specifically, How
effective is this solution in identifying picky modules and missing
input data items?
To answer the above questions, we run a feasibility experiment, in
which we used a sample of 6 real-world workflows from the
myExperiment repository.
We selected workflows that involve deterministic modules, which mean
modules that deliver the same result (if any) given the same input.
We did not consider workflows that include modules performing data
mining operations, for instance.
We have also selected workflows for which the inverse modules are also
deterministic functions.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 19 / 26
20. Feasibility Study
We have executed each workflow using example data inputs provided
by the workflow authors.
We then specified two kinds of queries for each work!ow:
Instance-based why-not query. To assess the ability of the algorithm in
answering this type of queries, we randomly selected an output data
item d that was returned by the workflow executions. Next, we used
our algorithm to see if it is able to reconstruct the lineage of d by
harvesting the web to identify the input data items that were
responsible for its derivation.
Module-based why-not query This kind of query is used to assess if the
algorithm is able to identify picky modules
In total we had 6 queries of the first kind, which we denote by
{q+
1 , . . . , q+
6 }, and 6 queries of the second kind, which we denote by
{q−
1 , . . . , q−
6 }.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 20 / 26
21. Feasibility Study: Results
Of the queries {q+
1 , . . . , q+
6 }, our algorithm was able to successfully
constructs the provenance of the why-not query up to the workflow
input for 3 queries.
Most of the modules composing these workflows, namely 8 out of 11,
provides information about the input and output datasets on the Web
using Tabular formats.
After examination of the three remaining workflows, we found that
one them utilizes proprietary data sources, the content of which is not
accessible on the surface web.
The last two workflows, on the other hand, contain modules that
manipulate excerpt from HTML web pages. Because of this, our
algorithm was not able to find the content on the Web of the input
and output of those modules.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 21 / 26
22. Feasibility Study: Results
We also measured the number of Top-k web pages that needed to be
examined to identify the input data item corresponding to a given
output data item. On average, we needed to examine the content of
the 4 top web pages returned by the key-word search engine1.
In several cases, however, the top web page was the right one, in the
sense that it contained the input data item we are after.
1
We used the Google search engine for our experiment.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 22 / 26
23. Feasibility Study: Results
Regarding the queries {q−
1 , . . . , q−
6 }, our algorithm was more
successful in the sense that it was able to correctly identify 4 picky
modules out of 6.
For two remaining workflows, the module that was identified as picky
by our algorithm was not the correct one. After examination, it
transpired that for certain modules the corresponding data item could
not be found on the web.
Again this issue was due to shims modules the input and output data
items are not published on the Web.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 23 / 26
24. Conclusions
To sum up, this small feasibility study has shown that our method is
promising.
It has also brought some insights into the way our solution can be
improved.
Our ongoing work includes: i)- tuning our algorithm to deal with
shims modules in a workflow, ii)- explore new source of information
for identifying picky modules, and ii)- an experiment involving a large
number of scientific workflows.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 24 / 26
25. References
K. Belhajjame (2018)
On Answering Why-Not Queries Against Scientific Workflow Provenance
Proceeding of EDBT, Open Proceedings 465–468.
N. Bidoit, M. Herschel, K. Tzompanaki (2014)
Why not?
Proceeding of EDBT, Open Proceedings 145–156.
A. Chapman and H.V. Jagadish (2009)
Why not?
Proceeding of SIGMOD, ACM 523–534.
J. Huang, T. Chen, A. Doan, and J. F. Naughton (2008)
On the provenance of non-answers to queries over extracted data
Proceeding of VLDB, ACM 736-747.
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 25 / 26