SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Detecting Duplicate Records in
 Scientific Workflow Results


Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
               1University of Manchester

                2University of Newcastle
Scientific Workflows
                  Scientific workflows are increasingly
                  used by scientists as a means for
                  specifying and enacting their
                  experiments.
                  They tend to be data intensive

                  The data sets obtained as a result of
                  their enactment can be stored in
                  public repositories to be queried,
                  analyzed and used to feed the
                  execution of other workflows.
2   IPAW 2012
Duplicates in Workflow Results

      The datasets obtained as a result of workflow execution often
       contain duplicates.
      As a result:
         The analysis and interpretation of workflow results may become
          tedious.
         The presence of duplicates also unnecessarily increases the size
          of workflow results.



3   IPAW 2012
Duplicate Record Detection
      Research in duplicate record detection has been active for
        more than three decades.
          Elmagarmid et al., 2007 conducted a comprehensive survey of
            the topics.

      We do not aim to design yet another algorithm for
        comparing and matching records.
      Rather, we investigate how provenance traces produced as a
        result of workflow executions can be used to guide the
        detection of duplicate records in workflow results.
    Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE
    Trans. Knowl. Data Eng., 19(1):1–16,2007.
4   IPAW 2012
Outline

      Data-Driven Workflows and Provenance Trace


      A method for guiding duplicates detection in workflow
       results based on provenance traces.

      Preliminary validation using real-world workflows.




5   IPAW 2012
Preliminaries: Data-Driven Workflows
      A data driven workflow can be defined as a directed graph:

                         wf = N, E
      A node represent an analysis operation, which has a set of
       input and output parameters.
                      op, Iop , Oop  ∈ N
      The edges are dataflow dependencies:

                  op, o, op , i ∈ E
6   IPAW 2012
Preliminaries: Provenance Trace
    The execution of workflows gives rise to provenance trace,
    which we capture using two relations.
      Transformation: to specify that the execution of an
    operation took as input a given ordered set of records and
    generated another ordered set of records.
      op, o1 , ro1 , . . . , op, om , rom     op, i1 , ri1 , . . . , op, in , rin
                                   OutBop     InBop

      Transfer: to specify transfer of records along the edges of
    the workflow.                op , i , r   op, o, r


7   IPAW 2012
Outline

      Data-Driven Workflows and Provenance Trace


      A method for guiding duplicates detection in workflow
       results based on provenance traces.

      Preliminary validation using real-world workflows.




8   IPAW 2012
Provenance-Guided Detection of
    Duplicates: Approach
    To guide the detection of duplicates in workflow results we
      explore the following fact:

      An operation that is known to be deterministic produces
       identical output bindings given the same input binding.

    deterministic op      OutBop    InBop   T    OutBop    InBop    T
                                                    id OutBop , OutBop




9   IPAW 2012
Provenance-Guided Detection of
     Duplicates: Example
     i                                  o        i’                     o’
                 IdentifyProtein                       GetGOTerm

     Ri                                 Ro       R’i                    R’o
     1.  The set of records Ri that are bound to the input parameter
         of the starting operation are compared to identify duplicate
         records.

          The result of this phase is a partition of disjoint sets of
          identical records.
                                   Ri       R1
                                             i         Rn
                                                        i
10   IPAW 2012
Provenance-Guided Detection of
      Duplicates: Example
     i                                     o          i’                             o’
                  IdentifyProtein                               GetGOTerm

     Ri                                   Ro         R’i                             R’o
      2.  The sets of records Ro, R’i and R’o are partitioned into sets
          of identical records based on the partitioning of Ri. For
          example:               1          n
                                    Ro     Ro              Ro
     Ri
      o   ro Ro s.t. ri Ri ,
                         i          IdentifyProtein, o, ro        IdentifyProtein, i, ri



11    IPAW 2012
Provenance-Guided Detection of
     Duplicates: Example
       In the example just described, the operations that compose
        the workflow have exactly one input and one output
        parameter.
          However, the algorithm presented in the paper supports
           operations with multiple input and output parameters.

       Notice that we assumes that the analysis operations that
        compose the workflow are deterministic. This is not always
        the case.
          This raises the question as to how to determine that a given
           operation is deterministic.
12   IPAW 2012
Verifying The Determinism of Analysis
     Operations
       To verify the determinism of operations, we use an approach
       whereby operations are probed.
     1.  Given an operation op, we select examples values that can
           be used by the inputs of op, and invoke op using those
           values multiple times.
     2.     If op produces identical output values given identical input
           values, then it is likely to be deterministic, otherwise, it is
           not deterministic.



13   IPAW 2012
Collection-Based Workflows
     To support duplicates detection in collection based workflows we
     need to be able to:
       Identify when two collections are identical
        Two collections Ri and Rj are identical if they are of the same size and
        there is a bijective mapping:
                                map : Ri          Rj
        that maps each record ri in Ri to a record rj in Rj such that ri and rj are
        identical
       Identify duplicates records between two collections that
        are known to be identical
        Identify a bijective mapping that maps every ri in Ri to an identical
        rj in Rj.
14   IPAW 2012
Outline

       Data-Driven Workflows and Provenance Trace


       A method for guiding duplicates detection in workflow
        results based on provenance traces.

       Preliminary validation using real-world workflows.




15   IPAW 2012
Validation
       The method that we presented in this paper can be applied when the operations
        are deterministic.

       To have an insight on the degree to which the operations that compose the
        workflows are deterministic, we run en experiments

       Datasets: 15 bioinformatics workflows that cover a wide range of analyzes,
        namely biological pathway analysis, sequence alignment, molecular interaction
        analysis

       Process: To identify which of these operations are deterministic, we run each
        of them 3 times using example values that were found either within
        myExperiment or Biocatalogue


16   IPAW 2012
Validation
       After manual analysis of the results, it transpires that 5 operations out of
         the 151 operations that compose the wokflows are not deterministic.

       Note that many of the operations that we analyzed access and use
         underlying data sources in their computation. Therefore updates to such
         sources may break the determinism assumption (Chirigati and Freire,
         2012).

       This suggests that the determinism holds within a window of time
         during which the underlying sources remain the same, and that there is a
         need for monitoring techniques to identify such windows.
     Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical
     Approach . IPAW, 2012.
17   IPAW 2012
Conclusions and Future Work

      we described a method that can be used to guide duplicate
       detection in workflow results.

       Monitoring the determinism of analysis operations


       Extending the method to support duplicate detection across
        the results of different workflows.



18   IPAW 2012
Detecting Duplicate Records in
 Scientific Workflow Results


Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
               1University of Manchester

                2University of Newcastle

Mais conteúdo relacionado

Semelhante a Detecting Duplicate Records in Scientific Workflow Results

An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...Daniele Dell'Aglio
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
 
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs Lisong Guo
 
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...ESEM 2014
 
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsSherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsDacong (Tony) Yan
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Anubhav Jain
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and modelsmyGrid team
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)Stian Soiland-Reyes
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)Stian Soiland-Reyes
 
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...Jan Aerts
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholdermhaendel
 
HyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologiesHyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologiesMichel Dumontier
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 

Semelhante a Detecting Duplicate Records in Scientific Workflow Results (20)

2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
 
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsSherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
 
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholder
 
HyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologiesHyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologies
 
icpr_2012
icpr_2012icpr_2012
icpr_2012
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 

Mais de Khalid Belhajjame

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Khalid Belhajjame
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in SepublicaKhalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenanceKhalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Khalid Belhajjame
 

Mais de Khalid Belhajjame (18)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 
Edbt 2010, Belhajjame
Edbt 2010, BelhajjameEdbt 2010, Belhajjame
Edbt 2010, Belhajjame
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Detecting Duplicate Records in Scientific Workflow Results

  • 1. Detecting Duplicate Records in Scientific Workflow Results Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester 2University of Newcastle
  • 2. Scientific Workflows   Scientific workflows are increasingly used by scientists as a means for specifying and enacting their experiments.   They tend to be data intensive   The data sets obtained as a result of their enactment can be stored in public repositories to be queried, analyzed and used to feed the execution of other workflows. 2 IPAW 2012
  • 3. Duplicates in Workflow Results   The datasets obtained as a result of workflow execution often contain duplicates.   As a result:   The analysis and interpretation of workflow results may become tedious.   The presence of duplicates also unnecessarily increases the size of workflow results. 3 IPAW 2012
  • 4. Duplicate Record Detection   Research in duplicate record detection has been active for more than three decades.   Elmagarmid et al., 2007 conducted a comprehensive survey of the topics.   We do not aim to design yet another algorithm for comparing and matching records.   Rather, we investigate how provenance traces produced as a result of workflow executions can be used to guide the detection of duplicate records in workflow results. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16,2007. 4 IPAW 2012
  • 5. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 5 IPAW 2012
  • 6. Preliminaries: Data-Driven Workflows   A data driven workflow can be defined as a directed graph: wf = N, E   A node represent an analysis operation, which has a set of input and output parameters. op, Iop , Oop ∈ N   The edges are dataflow dependencies: op, o, op , i ∈ E 6 IPAW 2012
  • 7. Preliminaries: Provenance Trace The execution of workflows gives rise to provenance trace, which we capture using two relations.   Transformation: to specify that the execution of an operation took as input a given ordered set of records and generated another ordered set of records. op, o1 , ro1 , . . . , op, om , rom op, i1 , ri1 , . . . , op, in , rin OutBop InBop   Transfer: to specify transfer of records along the edges of the workflow. op , i , r op, o, r 7 IPAW 2012
  • 8. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 8 IPAW 2012
  • 9. Provenance-Guided Detection of Duplicates: Approach To guide the detection of duplicates in workflow results we explore the following fact:   An operation that is known to be deterministic produces identical output bindings given the same input binding. deterministic op OutBop InBop T OutBop InBop T id OutBop , OutBop 9 IPAW 2012
  • 10. Provenance-Guided Detection of Duplicates: Example i o i’ o’ IdentifyProtein GetGOTerm Ri Ro R’i R’o 1.  The set of records Ri that are bound to the input parameter of the starting operation are compared to identify duplicate records. The result of this phase is a partition of disjoint sets of identical records. Ri R1 i Rn i 10 IPAW 2012
  • 11. Provenance-Guided Detection of Duplicates: Example i o i’ o’ IdentifyProtein GetGOTerm Ri Ro R’i R’o 2.  The sets of records Ro, R’i and R’o are partitioned into sets of identical records based on the partitioning of Ri. For example: 1 n Ro Ro Ro Ri o ro Ro s.t. ri Ri , i IdentifyProtein, o, ro IdentifyProtein, i, ri 11 IPAW 2012
  • 12. Provenance-Guided Detection of Duplicates: Example   In the example just described, the operations that compose the workflow have exactly one input and one output parameter.   However, the algorithm presented in the paper supports operations with multiple input and output parameters.   Notice that we assumes that the analysis operations that compose the workflow are deterministic. This is not always the case.   This raises the question as to how to determine that a given operation is deterministic. 12 IPAW 2012
  • 13. Verifying The Determinism of Analysis Operations To verify the determinism of operations, we use an approach whereby operations are probed. 1.  Given an operation op, we select examples values that can be used by the inputs of op, and invoke op using those values multiple times. 2.  If op produces identical output values given identical input values, then it is likely to be deterministic, otherwise, it is not deterministic. 13 IPAW 2012
  • 14. Collection-Based Workflows To support duplicates detection in collection based workflows we need to be able to:   Identify when two collections are identical Two collections Ri and Rj are identical if they are of the same size and there is a bijective mapping: map : Ri Rj that maps each record ri in Ri to a record rj in Rj such that ri and rj are identical   Identify duplicates records between two collections that are known to be identical Identify a bijective mapping that maps every ri in Ri to an identical rj in Rj. 14 IPAW 2012
  • 15. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 15 IPAW 2012
  • 16. Validation   The method that we presented in this paper can be applied when the operations are deterministic.   To have an insight on the degree to which the operations that compose the workflows are deterministic, we run en experiments   Datasets: 15 bioinformatics workflows that cover a wide range of analyzes, namely biological pathway analysis, sequence alignment, molecular interaction analysis   Process: To identify which of these operations are deterministic, we run each of them 3 times using example values that were found either within myExperiment or Biocatalogue 16 IPAW 2012
  • 17. Validation   After manual analysis of the results, it transpires that 5 operations out of the 151 operations that compose the wokflows are not deterministic.   Note that many of the operations that we analyzed access and use underlying data sources in their computation. Therefore updates to such sources may break the determinism assumption (Chirigati and Freire, 2012).   This suggests that the determinism holds within a window of time during which the underlying sources remain the same, and that there is a need for monitoring techniques to identify such windows. Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical Approach . IPAW, 2012. 17 IPAW 2012
  • 18. Conclusions and Future Work  we described a method that can be used to guide duplicate detection in workflow results.   Monitoring the determinism of analysis operations   Extending the method to support duplicate detection across the results of different workflows. 18 IPAW 2012
  • 19. Detecting Duplicate Records in Scientific Workflow Results Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester 2University of Newcastle