[2024]Digital Global Overview Report 2024 Meltwater.pdf
Detecting Duplicate Records in Scientific Workflow Results
1. Detecting Duplicate Records in
Scientific Workflow Results
Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
1University of Manchester
2University of Newcastle
2. Scientific Workflows
Scientific workflows are increasingly
used by scientists as a means for
specifying and enacting their
experiments.
They tend to be data intensive
The data sets obtained as a result of
their enactment can be stored in
public repositories to be queried,
analyzed and used to feed the
execution of other workflows.
2 IPAW 2012
3. Duplicates in Workflow Results
The datasets obtained as a result of workflow execution often
contain duplicates.
As a result:
The analysis and interpretation of workflow results may become
tedious.
The presence of duplicates also unnecessarily increases the size
of workflow results.
3 IPAW 2012
4. Duplicate Record Detection
Research in duplicate record detection has been active for
more than three decades.
Elmagarmid et al., 2007 conducted a comprehensive survey of
the topics.
We do not aim to design yet another algorithm for
comparing and matching records.
Rather, we investigate how provenance traces produced as a
result of workflow executions can be used to guide the
detection of duplicate records in workflow results.
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE
Trans. Knowl. Data Eng., 19(1):1–16,2007.
4 IPAW 2012
5. Outline
Data-Driven Workflows and Provenance Trace
A method for guiding duplicates detection in workflow
results based on provenance traces.
Preliminary validation using real-world workflows.
5 IPAW 2012
6. Preliminaries: Data-Driven Workflows
A data driven workflow can be defined as a directed graph:
wf = N, E
A node represent an analysis operation, which has a set of
input and output parameters.
op, Iop , Oop ∈ N
The edges are dataflow dependencies:
op, o, op , i ∈ E
6 IPAW 2012
7. Preliminaries: Provenance Trace
The execution of workflows gives rise to provenance trace,
which we capture using two relations.
Transformation: to specify that the execution of an
operation took as input a given ordered set of records and
generated another ordered set of records.
op, o1 , ro1 , . . . , op, om , rom op, i1 , ri1 , . . . , op, in , rin
OutBop InBop
Transfer: to specify transfer of records along the edges of
the workflow. op , i , r op, o, r
7 IPAW 2012
8. Outline
Data-Driven Workflows and Provenance Trace
A method for guiding duplicates detection in workflow
results based on provenance traces.
Preliminary validation using real-world workflows.
8 IPAW 2012
9. Provenance-Guided Detection of
Duplicates: Approach
To guide the detection of duplicates in workflow results we
explore the following fact:
An operation that is known to be deterministic produces
identical output bindings given the same input binding.
deterministic op OutBop InBop T OutBop InBop T
id OutBop , OutBop
9 IPAW 2012
10. Provenance-Guided Detection of
Duplicates: Example
i o i’ o’
IdentifyProtein GetGOTerm
Ri Ro R’i R’o
1. The set of records Ri that are bound to the input parameter
of the starting operation are compared to identify duplicate
records.
The result of this phase is a partition of disjoint sets of
identical records.
Ri R1
i Rn
i
10 IPAW 2012
11. Provenance-Guided Detection of
Duplicates: Example
i o i’ o’
IdentifyProtein GetGOTerm
Ri Ro R’i R’o
2. The sets of records Ro, R’i and R’o are partitioned into sets
of identical records based on the partitioning of Ri. For
example: 1 n
Ro Ro Ro
Ri
o ro Ro s.t. ri Ri ,
i IdentifyProtein, o, ro IdentifyProtein, i, ri
11 IPAW 2012
12. Provenance-Guided Detection of
Duplicates: Example
In the example just described, the operations that compose
the workflow have exactly one input and one output
parameter.
However, the algorithm presented in the paper supports
operations with multiple input and output parameters.
Notice that we assumes that the analysis operations that
compose the workflow are deterministic. This is not always
the case.
This raises the question as to how to determine that a given
operation is deterministic.
12 IPAW 2012
13. Verifying The Determinism of Analysis
Operations
To verify the determinism of operations, we use an approach
whereby operations are probed.
1. Given an operation op, we select examples values that can
be used by the inputs of op, and invoke op using those
values multiple times.
2. If op produces identical output values given identical input
values, then it is likely to be deterministic, otherwise, it is
not deterministic.
13 IPAW 2012
14. Collection-Based Workflows
To support duplicates detection in collection based workflows we
need to be able to:
Identify when two collections are identical
Two collections Ri and Rj are identical if they are of the same size and
there is a bijective mapping:
map : Ri Rj
that maps each record ri in Ri to a record rj in Rj such that ri and rj are
identical
Identify duplicates records between two collections that
are known to be identical
Identify a bijective mapping that maps every ri in Ri to an identical
rj in Rj.
14 IPAW 2012
15. Outline
Data-Driven Workflows and Provenance Trace
A method for guiding duplicates detection in workflow
results based on provenance traces.
Preliminary validation using real-world workflows.
15 IPAW 2012
16. Validation
The method that we presented in this paper can be applied when the operations
are deterministic.
To have an insight on the degree to which the operations that compose the
workflows are deterministic, we run en experiments
Datasets: 15 bioinformatics workflows that cover a wide range of analyzes,
namely biological pathway analysis, sequence alignment, molecular interaction
analysis
Process: To identify which of these operations are deterministic, we run each
of them 3 times using example values that were found either within
myExperiment or Biocatalogue
16 IPAW 2012
17. Validation
After manual analysis of the results, it transpires that 5 operations out of
the 151 operations that compose the wokflows are not deterministic.
Note that many of the operations that we analyzed access and use
underlying data sources in their computation. Therefore updates to such
sources may break the determinism assumption (Chirigati and Freire,
2012).
This suggests that the determinism holds within a window of time
during which the underlying sources remain the same, and that there is a
need for monitoring techniques to identify such windows.
Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical
Approach . IPAW, 2012.
17 IPAW 2012
18. Conclusions and Future Work
we described a method that can be used to guide duplicate
detection in workflow results.
Monitoring the determinism of analysis operations
Extending the method to support duplicate detection across
the results of different workflows.
18 IPAW 2012
19. Detecting Duplicate Records in
Scientific Workflow Results
Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
1University of Manchester
2University of Newcastle