Detecting Duplicate Records in Scientific Workflow Results

Detecting Duplicate Records in
Scientific Workflow Results

Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
1University of Manchester

2University of Newcastle

Scientific Workflows
  Scientific workflows are increasingly
used by scientists as a means for
specifying and enacting their
experiments.
  They tend to be data intensive

  The data sets obtained as a result of
their enactment can be stored in
public repositories to be queried,
analyzed and used to feed the
execution of other workflows.
2 IPAW 2012

Duplicates in Workflow Results

  The datasets obtained as a result of workflow execution often
contain duplicates.
  As a result:
  The analysis and interpretation of workflow results may become
tedious.
  The presence of duplicates also unnecessarily increases the size
of workflow results.

3 IPAW 2012

Duplicate Record Detection
  Research in duplicate record detection has been active for
more than three decades.
  Elmagarmid et al., 2007 conducted a comprehensive survey of
the topics.

  We do not aim to design yet another algorithm for
comparing and matching records.
  Rather, we investigate how provenance traces produced as a
result of workflow executions can be used to guide the
detection of duplicate records in workflow results.
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE
Trans. Knowl. Data Eng., 19(1):1–16,2007.
4 IPAW 2012

Outline

  Data-Driven Workflows and Provenance Trace

  A method for guiding duplicates detection in workflow
results based on provenance traces.

  Preliminary validation using real-world workflows.

5 IPAW 2012

Preliminaries: Data-Driven Workflows
  A data driven workflow can be defined as a directed graph:

wf = N, E
  A node represent an analysis operation, which has a set of
input and output parameters.
op, Iop , Oop ∈ N
  The edges are dataflow dependencies:

op, o, op , i ∈ E
6 IPAW 2012

Preliminaries: Provenance Trace
The execution of workflows gives rise to provenance trace,
which we capture using two relations.
  Transformation: to specify that the execution of an
operation took as input a given ordered set of records and
generated another ordered set of records.
op, o1 , ro1 , . . . , op, om , rom op, i1 , ri1 , . . . , op, in , rin
OutBop InBop

  Transfer: to specify transfer of records along the edges of
the workflow. op , i , r op, o, r

7 IPAW 2012

Outline

  Data-Driven Workflows and Provenance Trace

  A method for guiding duplicates detection in workflow


8 IPAW 2012

Provenance-Guided Detection of
Duplicates: Approach
To guide the detection of duplicates in workflow results we
explore the following fact:

  An operation that is known to be deterministic produces
identical output bindings given the same input binding.

deterministic op OutBop InBop T OutBop InBop T
id OutBop , OutBop

9 IPAW 2012

Duplicates: Example
i o i’ o’
IdentifyProtein GetGOTerm

Ri Ro R’i R’o
1.  The set of records Ri that are bound to the input parameter
of the starting operation are compared to identify duplicate
records.

The result of this phase is a partition of disjoint sets of
identical records.
Ri R1
i Rn
i
10 IPAW 2012

Duplicates: Example
i o i’ o’
IdentifyProtein GetGOTerm

Ri Ro R’i R’o
2.  The sets of records Ro, R’i and R’o are partitioned into sets
of identical records based on the partitioning of Ri. For
example: 1 n
Ro Ro Ro
Ri
o ro Ro s.t. ri Ri ,
i IdentifyProtein, o, ro IdentifyProtein, i, ri

11 IPAW 2012

Duplicates: Example
  In the example just described, the operations that compose
the workflow have exactly one input and one output
parameter.
  However, the algorithm presented in the paper supports
operations with multiple input and output parameters.

  Notice that we assumes that the analysis operations that
compose the workflow are deterministic. This is not always
the case.
  This raises the question as to how to determine that a given
operation is deterministic.
12 IPAW 2012

Verifying The Determinism of Analysis
Operations
To verify the determinism of operations, we use an approach
whereby operations are probed.
1.  Given an operation op, we select examples values that can
be used by the inputs of op, and invoke op using those
values multiple times.
2.  If op produces identical output values given identical input
values, then it is likely to be deterministic, otherwise, it is
not deterministic.

13 IPAW 2012

Collection-Based Workflows
To support duplicates detection in collection based workflows we
need to be able to:
  Identify when two collections are identical
Two collections Ri and Rj are identical if they are of the same size and
there is a bijective mapping:
map : Ri Rj
that maps each record ri in Ri to a record rj in Rj such that ri and rj are
identical
  Identify duplicates records between two collections that
are known to be identical
Identify a bijective mapping that maps every ri in Ri to an identical
rj in Rj.
14 IPAW 2012

Outline

  Data-Driven Workflows and Provenance Trace

  A method for guiding duplicates detection in workflow


15 IPAW 2012

Validation
  The method that we presented in this paper can be applied when the operations
are deterministic.

  To have an insight on the degree to which the operations that compose the
workflows are deterministic, we run en experiments

  Datasets: 15 bioinformatics workflows that cover a wide range of analyzes,
namely biological pathway analysis, sequence alignment, molecular interaction
analysis

  Process: To identify which of these operations are deterministic, we run each
of them 3 times using example values that were found either within
myExperiment or Biocatalogue

16 IPAW 2012

Validation
  After manual analysis of the results, it transpires that 5 operations out of
the 151 operations that compose the wokflows are not deterministic.

  Note that many of the operations that we analyzed access and use
underlying data sources in their computation. Therefore updates to such
sources may break the determinism assumption (Chirigati and Freire,
2012).

  This suggests that the determinism holds within a window of time
during which the underlying sources remain the same, and that there is a
need for monitoring techniques to identify such windows.
Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical
Approach . IPAW, 2012.
17 IPAW 2012

Conclusions and Future Work

 we described a method that can be used to guide duplicate
detection in workflow results.

  Monitoring the determinism of analysis operations

  Extending the method to support duplicate detection across
the results of different workflows.

18 IPAW 2012

Detecting Duplicate Records in Scientific Workflow Results

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Detecting Duplicate Records in Scientific Workflow Results

Semelhante a Detecting Duplicate Records in Scientific Workflow Results (20)

Mais de Khalid Belhajjame

Mais de Khalid Belhajjame (18)

Último

Último (20)

Detecting Duplicate Records in Scientific Workflow Results