Irpb workshop

On Answering Why-Not Queries Against Scientiﬁc
Workﬂow Provenance
Khalid Belhajjame
PSL Research University, Paris-Dauphine University, LAMSADE, Paris, 75016, France
khalid.belhajjame@dauphine.fr
July 13, 2018
Khalid Belhajjame (Paris-Dauphine) IRPb Workshop July 13, 2018 1 / 26

Context: Scientific Workflows
Scientific workflows have been
shown to facilitate and accelerate
scientific data exploration and
analysis in many areas of sciences,
including proteomics, metabolics,
astronomy, and bio-medicine.
The figure on the right side
illustrates an example of a simple
workflow used for identifying the
pathways associated with a given
input metabolite (compound).
Given a compound identifier, the
first module returns a compound
name, which is used to feed the
second module to obtain the
corresponding pathway.
Workflow input ports
Workflow output ports
compound_id
get_compound_info
output_pathways
extract_pathway_from_compounds_file

Aim: Evaluating Why-Not Queries Against Workflow
Executions
Why-not queries help scientists understand why a given data item,
e.g., their favorite biological pathway, was not returned by the
workflow executions.
While answering such queries has been thoroughly investigated for
relational databases, only a few proposals examined their evaluation
in the context of scientific workflows.
Objective: To elaborate a solution for evaluating why not queries
against workflows with black-box modules.

Related Work: Database (Querying) Land
Instance-based attempts to ﬁnd the data items in the inputs that are
responsible for the non appearance of a given data item in the result.
Consider the example below (taken from Huang et al. VLDB 2008).
The query returns the schools in the state of California are within the top 4
and have job openings.
The answer returned by the query is Stanford and its rank in the result.
Why-not query: Why does Berkley not appear in he results?
What change shall I make to the source to obtain (Homer, 25) in the results?
if a potential tuple (berkeley, ca, yes) is inserted into the openings table,
Berkeley will become an answer

Related Work: Database (Querying) Land
Module-based attempts to identify the modules (sub-queries) that
are responsible for the non-appearance of a given data item in the
workﬂow results.
In the case of the previous example, we have only one join, which is
responsible in this case for the non appearance of Berkley in the result
set of the query.

Related Work: Workflow Land
The only proposal in this category for workflow provenance is the
Why-Not algorithm proposed by Chapman and Jagadish 2009.
Using the Why-Not algorithm proposed by Chapman and Jagadish,
the user query is expressed as a set of atomic predicates that are
combined using AND and OR.
Chapman and Jagadish make the assumption that the attributes of
the input datasets are preserved by the modules that compose the
workflow.
This is not the case, however, in the general case.

Related Work: Workflow Land
For example, the modules in the workflow
illustrated on the right do not preserve the
attribute of the input, viz. Compound − ID,
in that the output of the first and the
second module do not contain information
about the compound identifier.
In the work presented in this talk, we drop
the assumption made by Chapman and
Jagadish, and propose a solution that can
be utilized for answering why-not queries
for workflow with modules that do not
preserve attributes of the input datasets.
Furthermore, unlike the Why-Not
algorithm which is module-based, our
proposal is hybrid in that it seeks to
answer instance- and module-based
why-not queries.
Workflow input ports
Workflow output ports
compound_id
get_compound_info
output_pathways
extract_pathway_from_compounds_file

Foundations
Why-not query: A user specifies a why-not query by providing a
data item dwhy−not that has the same data type as the output of the
last module of the workflow and was not returned by the workflow
executions.
Module pickyness: Central to the evaluation of why-not queries is
the pickyness of its modules. A module M in a workflow is picky with
respect to a data item d if its inverse Minv does not accept d as
input. More specifically, Minv throws an illegal input exception when
its execution is fed d.

Processing Why-Not Queries
The algorithm for processing why-not queries, takes as input a data item
dwhy−not specified by the user
To answer a why-not query, the modules of the workflow are explored from
the sink to the source in a breadth-first fashion. To do so, we group the
workflow modules into levels as illustrated in the figure below.

Processing Why-Not Queries
The modules of each level are examined to identify if the module is picky.
Speciﬁcally, the inverse of the module in question M is examined to check
if:
1 It does not accept the corresponding data items that were generated
by the inverse of the modules in the previous level.
2 It accepts the corresponding data items that were generated by the
inverse of the modules in the previous modules.
In this case, the data items the inverse of M produces are saved to be
used to feed the inverse of the modules in the succeeding levels, if any.

Identifying Picky Modules
To identify if a module M is picky, we need to invoke its inverse Minv ,
and check if it accepts the data items in question.

However, the inverse module rarely exists.

To overcome the non-existence of the inverse module, we can probe
the modules until we have the output we are after, or else fail and
deduce that the module in question is picky.

This is not a reasonable solution because the space of valid input
values of a module can be very large or even inﬁnite. The problem is
exacerbated by the fact that a module may have multiple inputs,
therefore requiring the construction of all possible combination for
probing.

This is not a reasonable solution because the space of valid input
values of a module can be very large or even inﬁnite. The problem is
exacerbated by the fact that a module may have multiple inputs,
therefore requiring the construction of all possible combination for
probing.
Is there a more reasonable solution... that at least allows us to probe
the modules using fewer inputs?

Identifying Picky Modules by Harvesting the Web
A solution that we explored consist in harvesting the (probably)
biggest source of information, namely the Web using the information
extraction process illustrated below.
Indeed, an important number of scientiﬁc modules that are provided
by major institutions, such as the EBI and DDBJ, provides also for
users the means to invoke these modules on the web, and the traces
of those module invocation remains in a number of cases accessible
on the Web.

If none of the candidate inputs is
found to be true positive, then we
conclude that the module is likely to
be picky.

Feasibility Study
The approach we have just described raises the following question. Is
the algorithm proposed able to identify the reason why a given data
item does not appear in the work!ow results? More specifically, How
effective is this solution in identifying picky modules and missing
input data items?
To answer the above questions, we run a feasibility experiment, in
which we used a sample of 6 real-world workflows from the
myExperiment repository.
We selected workflows that involve deterministic modules, which mean
modules that deliver the same result (if any) given the same input.
We did not consider workflows that include modules performing data
mining operations, for instance.
We have also selected workflows for which the inverse modules are also
deterministic functions.

Feasibility Study
We have executed each workflow using example data inputs provided
by the workflow authors.
We then specified two kinds of queries for each work!ow:
Instance-based why-not query. To assess the ability of the algorithm in
answering this type of queries, we randomly selected an output data
item d that was returned by the workflow executions. Next, we used
our algorithm to see if it is able to reconstruct the lineage of d by
harvesting the web to identify the input data items that were
responsible for its derivation.
Module-based why-not query This kind of query is used to assess if the
algorithm is able to identify picky modules
In total we had 6 queries of the first kind, which we denote by
{q+
1 , . . . , q+
6 }, and 6 queries of the second kind, which we denote by
{q−
1 , . . . , q−
6 }.

Feasibility Study: Results
Of the queries {q+
1 , . . . , q+
6 }, our algorithm was able to successfully
constructs the provenance of the why-not query up to the workflow
input for 3 queries.
Most of the modules composing these workflows, namely 8 out of 11,
provides information about the input and output datasets on the Web
using Tabular formats.
After examination of the three remaining workflows, we found that
one them utilizes proprietary data sources, the content of which is not
accessible on the surface web.
The last two workflows, on the other hand, contain modules that
manipulate excerpt from HTML web pages. Because of this, our
algorithm was not able to find the content on the Web of the input
and output of those modules.

We also measured the number of Top-k web pages that needed to be
examined to identify the input data item corresponding to a given
output data item. On average, we needed to examine the content of
the 4 top web pages returned by the key-word search engine1.
In several cases, however, the top web page was the right one, in the
sense that it contained the input data item we are after.
1
We used the Google search engine for our experiment.

Regarding the queries {q−
1 , . . . , q−
6 }, our algorithm was more
successful in the sense that it was able to correctly identify 4 picky
modules out of 6.
For two remaining workﬂows, the module that was identiﬁed as picky
by our algorithm was not the correct one. After examination, it
transpired that for certain modules the corresponding data item could
not be found on the web.
Again this issue was due to shims modules the input and output data
items are not published on the Web.

Conclusions
To sum up, this small feasibility study has shown that our method is
promising.
It has also brought some insights into the way our solution can be
improved.
Our ongoing work includes: i)- tuning our algorithm to deal with
shims modules in a workflow, ii)- explore new source of information
for identifying picky modules, and ii)- an experiment involving a large
number of scientific workflows.

References
K. Belhajjame (2018)
On Answering Why-Not Queries Against Scientiﬁc Workﬂow Provenance
Proceeding of EDBT, Open Proceedings 465–468.
N. Bidoit, M. Herschel, K. Tzompanaki (2014)
Why not?
Proceeding of EDBT, Open Proceedings 145–156.
A. Chapman and H.V. Jagadish (2009)
Why not?
Proceeding of SIGMOD, ACM 523–534.
J. Huang, T. Chen, A. Doan, and J. F. Naughton (2008)
On the provenance of non-answers to queries over extracted data
Proceeding of VLDB, ACM 736-747.

The End

Irpb workshop

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Irpb workshop

Semelhante a Irpb workshop (20)

Mais de Khalid Belhajjame

Mais de Khalid Belhajjame (20)

Último

Último (20)

Irpb workshop