SlideShare a Scribd company logo
1 of 22
Paolo Missier and Jacek Cala
Newcastle University, UK
IEEE Big Data Congress
Milan, Italy
July 8th, 2019
Efficient Re-computation of Big Data Analytics Processes
in the Presence of Changes
In collaboration with
• Institute of Genetic Medicine, Newcastle University
• School of GeoSciences, Newcastle University
2
Context
Big
Data
The Big
Analytics
Machine
Actionable
Knowledge
Analytics
Data Science over time V3
V2
V1
Meta-knowledge
Algorithms
Tools
Libraries
Reference
datasets
t
t
t
3
What changes?
• Genomics
• Reference databases
• Algorithms and libraries
• Simulation
• Large parameter space
• Input conditions
• Machine Learning
• Evolving ground truth datasets
• Model re-training
4
Genomics
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsengland.co.uk/the-100000-genomes-project/
Spark GATK tools on Azure:
45 mins / GB
@ 13GB / exome: about 10 hours
5
Genomics: WES / WGS, Variant calling  Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh,
M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
SVI: Simple Variant Interpretation
Variant classification : pathogenic, benign and unknown/uncertain
6
Blind reaction to change: a game of battleship
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes
detected
4.2 hours of computation per change
Should we care about updates?
Evolving knowledge about
gene variations
7
ReComp
http://recomp.org.uk/
Outcome:
A framework for selective Re-computation
• Generic, Customisable
Scope:
expensive analysis +
frequent changes +
not all changes significant
Challenge:
Make re-computation efficient in response
to changes
Assumptions:
Processes are
• Observable
• Reproducible
• Estimates are cheap
Insight: replace re-computation with change impact estimation
Using history of past executions
8
Reproducibility
How
Selective:
- Across a cohort of past executions.  which subset of individuals?
- Within a single re-execution  which process fragments?
Change in
ClinVar
Change in
GeneMap
 Why, when, to what extent
9
The rest of the talk
• Approach
• Architecture
• Evaluation (case study)
10
The ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instances:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation
11
How much do we know about P?
Impact estimation
Re-execution
less more
Process structure
Execution trace
black box
I/O provenance
I/O only
All-or-nothing
monolithic process, legacy
 a complex simulator
white box
step-by-step provenance
workflows, R / python code
 genomics analyticsTypical process
Fine-grained Impact
Partial  restart trees (*)
(*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs.
IPAW 2018. London: Springer; 2018.
12
SVI: data-diff and impact functions
- Data-specific
- Process-specificomim
clinvar
Overall
impact
13
Diff functions for SVI
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
Relational data  simple set difference
14
Example impact functions: SVI
Returns True iff:
- Known variants have moved in/out of Red status
- New Red variants have appeared
- Known Red variants have been retracted
15
ReComp decision matrix for SVI
Impact: yes / no / not assessed
delta functions: data diff detected?
16
Empirical validation
PaoloMissier2019
IEEEBigDataCongress
re-executions 495  71 Ideal: 14
But: no false negatives
17
SVI implemented using workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified
variants
Phenotype
18
Execution trace / Provenance
User Execution
«Association » «Usage» «Generation »
«Entity»
«Collection»
Controller Program
Workflow Channel
Port
wasPartOf
«hadMember »
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls[*]
[*]
[*]
[*] [*] [*]
«wasDerivedFrom »
[*][*]
[0..1]
[0..1]
[0..1]
[*][1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*][0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*][0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*][1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPorthadInPort
[*][1]
[1] [1]
[1] [1]
hadEntity
hasDefaultParam
19
SVI – restart trees
Overhead:
caching
intermediate data
Time savings Partial re-exec (sec) Complete re-exec Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
Change in
ClinVar
Change in
GeneMap
Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
20
Architecture
<eventname>
ReComp Core
HDB
«ProvONE store»
Tabular-Diff
Service
Tabular-Diff
Service
Difference
Function
ReExecution
Service A
ReExecution
Service A
ReExecution
Function
Impact
Service B
Impact
Service B
Impact
Function
ReComp
Loop
User
Process
Runtime Environment
Inputs Outputs
Interface
Di f f Se r vi c e
Interface
I mpa c t Se r vi c e
Interface
Re Exe c Se r vi c e
Process and data provenance Prolog facts
store/retrieve
REST API
External services
REST API
Executes
restart trees
- React to
change events
- Construct
restart trees
21
Customising ReComp in practice
<eventname>
Enable
provenance
capture /
Map to PROV
22
Summary
<eventname>
Evaluation: case-by-case basis
- Cost savings
- Ease of customisation
Generic framework
Fine-grained provenance + control  max savings
Tested on two cases studies
- Genomics
- Simulation (flood modelling)  see paper

More Related Content

What's hot

Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool developmentAnubhav Jain
 
PR157: Best of both worlds: human-machine collaboration for object annotation
PR157: Best of both worlds: human-machine collaboration for object annotationPR157: Best of both worlds: human-machine collaboration for object annotation
PR157: Best of both worlds: human-machine collaboration for object annotationjaewon lee
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)Anubhav Jain
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data AnalyticsAnubhav Jain
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAnubhav Jain
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud Paolo Missier
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
 

What's hot (20)

Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool development
 
PR157: Best of both worlds: human-machine collaboration for object annotation
PR157: Best of both worlds: human-machine collaboration for object annotationPR157: Best of both worlds: human-machine collaboration for object annotation
PR157: Best of both worlds: human-machine collaboration for object annotation
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data Analytics
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 

Similar to Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Replication and Benchmarking in Software Analytics
Replication and Benchmarking in Software AnalyticsReplication and Benchmarking in Software Analytics
Replication and Benchmarking in Software AnalyticsUniversity of Zurich
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
 
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingYoungSeok Yoon
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 

Similar to Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes (20)

ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Replication and Benchmarking in Software Analytics
Replication and Benchmarking in Software AnalyticsReplication and Benchmarking in Software Analytics
Replication and Benchmarking in Software Analytics
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Pine education-platform
Pine education-platformPine education-platform
Pine education-platform
 
C4Bio paper talk
C4Bio paper talkC4Bio paper talk
C4Bio paper talk
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
TiMetmay10
TiMetmay10TiMetmay10
TiMetmay10
 
Ti met may10
Ti met may10Ti met may10
Ti met may10
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 

More from Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationPaolo Missier
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Paolo Missier
 
Mind My Value: A decentralised infrastructure for fair and trusted IoT data ...
Mind My Value:  A decentralised infrastructure for fair and trusted IoT data ...Mind My Value:  A decentralised infrastructure for fair and trusted IoT data ...
Mind My Value: A decentralised infrastructure for fair and trusted IoT data ...Paolo Missier
 
Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)
 
Mind My Value: A decentralised infrastructure for fair and trusted IoT data ...
Mind My Value:  A decentralised infrastructure for fair and trusted IoT data ...Mind My Value:  A decentralised infrastructure for fair and trusted IoT data ...
Mind My Value: A decentralised infrastructure for fair and trusted IoT data ...
 
Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

  • 1. Paolo Missier and Jacek Cala Newcastle University, UK IEEE Big Data Congress Milan, Italy July 8th, 2019 Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes In collaboration with • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  • 2. 2 Context Big Data The Big Analytics Machine Actionable Knowledge Analytics Data Science over time V3 V2 V1 Meta-knowledge Algorithms Tools Libraries Reference datasets t t t
  • 3. 3 What changes? • Genomics • Reference databases • Algorithms and libraries • Simulation • Large parameter space • Input conditions • Machine Learning • Evolving ground truth datasets • Model re-training
  • 4. 4 Genomics Image credits: Broad Institute https://software.broadinstitute.org/gatk/ https://www.genomicsengland.co.uk/the-100000-genomes-project/ Spark GATK tools on Azure: 45 mins / GB @ 13GB / exome: about 10 hours
  • 5. 5 Genomics: WES / WGS, Variant calling  Variant interpretation SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer SVI: Simple Variant Interpretation Variant classification : pathogenic, benign and unknown/uncertain
  • 6. 6 Blind reaction to change: a game of battleship Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected 4.2 hours of computation per change Should we care about updates? Evolving knowledge about gene variations
  • 7. 7 ReComp http://recomp.org.uk/ Outcome: A framework for selective Re-computation • Generic, Customisable Scope: expensive analysis + frequent changes + not all changes significant Challenge: Make re-computation efficient in response to changes Assumptions: Processes are • Observable • Reproducible • Estimates are cheap Insight: replace re-computation with change impact estimation Using history of past executions
  • 8. 8 Reproducibility How Selective: - Across a cohort of past executions.  which subset of individuals? - Within a single re-execution  which process fragments? Change in ClinVar Change in GeneMap  Why, when, to what extent
  • 9. 9 The rest of the talk • Approach • Architecture • Evaluation (case study)
  • 10. 10 The ReComp meta-process History DB Detect and quantify changes data diff(d,d’) Record execution history Analytics Process P Log / provenance Partially Re-exec P (D) P(D’) Change Events Changes: • Reference datasets • Inputs For each past instances: Estimate impact of changes Impact(dd’, o) impact estimation functions Scope Select relevant sub-processes Optimisation
  • 11. 11 How much do we know about P? Impact estimation Re-execution less more Process structure Execution trace black box I/O provenance I/O only All-or-nothing monolithic process, legacy  a complex simulator white box step-by-step provenance workflows, R / python code  genomics analyticsTypical process Fine-grained Impact Partial  restart trees (*) (*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018. London: Springer; 2018.
  • 12. 12 SVI: data-diff and impact functions - Data-specific - Process-specificomim clinvar Overall impact
  • 13. 13 Diff functions for SVI ClinVar 1/2016 ClinVar 1/2017 diff (unchanged) Relational data  simple set difference
  • 14. 14 Example impact functions: SVI Returns True iff: - Known variants have moved in/out of Red status - New Red variants have appeared - Known Red variants have been retracted
  • 15. 15 ReComp decision matrix for SVI Impact: yes / no / not assessed delta functions: data diff detected?
  • 17. 17 SVI implemented using workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  • 18. 18 Execution trace / Provenance User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] «wasDerivedFrom » [*][*] [0..1] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] [1] hadEntity hasDefaultParam
  • 19. 19 SVI – restart trees Overhead: caching intermediate data Time savings Partial re-exec (sec) Complete re-exec Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37 Change in ClinVar Change in GeneMap Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
  • 20. 20 Architecture <eventname> ReComp Core HDB «ProvONE store» Tabular-Diff Service Tabular-Diff Service Difference Function ReExecution Service A ReExecution Service A ReExecution Function Impact Service B Impact Service B Impact Function ReComp Loop User Process Runtime Environment Inputs Outputs Interface Di f f Se r vi c e Interface I mpa c t Se r vi c e Interface Re Exe c Se r vi c e Process and data provenance Prolog facts store/retrieve REST API External services REST API Executes restart trees - React to change events - Construct restart trees
  • 21. 21 Customising ReComp in practice <eventname> Enable provenance capture / Map to PROV
  • 22. 22 Summary <eventname> Evaluation: case-by-case basis - Cost savings - Ease of customisation Generic framework Fine-grained provenance + control  max savings Tested on two cases studies - Genomics - Simulation (flood modelling)  see paper

Editor's Notes

  1. We are going to ignore BDA in this talk And also simulation although it’s a case study
  2. We are going to use this smaller process as a testbed Changes in the reference databases have an impact on the classification
  3. Threats: Will any of the changes invalidate prior findings? Opportunities: Can the findings be improved over time? Can we do better in a generic way? We need to control re-computation on two dimensions Across a population Within a single process
  4. Success criteria: performance, but this is on a case-by-case basis Ease of customization. The focus of this paper
  5. The framework is a meta-process… Changes can also occur to OS, libraries and other dependencies but these are out of scope
  6. The black box case is illustrated here and is less interesting. The more interesting SVI case is in the next slide
  7. Impact functions are currently only binary
  8. \delta_4(\text{CV}^t, \text{CV}^{t'}) = & \langle \delta_4^+,  \delta_4^-, \delta_4^{\pm} \rangle 
  9. \phi_5( \delta_4^+, \delta_4^-, \delta_4^{\pm}, \mathit{val}(o_5)) \in \{ \text{True}, \text{False}\} \delta_1(\text{GM}^t, \text{GM}^{t'}) = & \langle \delta_1^+, \delta_1^-, \delta_1^{\pm} \rangle \\ \delta_4(\text{CV}^t, \text{CV}^{t'}) = & \langle \delta_4^+ \delta_4^-, \delta_4^{\pm} \rangle  \phi_1( \delta_1^+, \delta_1^-, \delta_1^{\pm})  \phi_5( \delta_4^+, \delta_4^-, \delta_4^{\pm}, \mathit{val}(o_5))  \phi_5( \delta_4^+, \delta_4^-, \delta_4^{\pm}, \mathit{val}(o_5)) \\  &\text{returns True iff: } \\ - \quad&\delta_4^- ~\text{or}~ \delta_4^+~\text{includes a Red variant} \\ - \quad &\text{pathogenic status changed for any variant in}~ \delta_4^{\pm}
  10. \delta_1, \delta_4 \phi_1, \phi_5
  11. This shows the good case of “Gerry box” workflow and box-level provenance SVI workflow with automated provenance recording Cohort of about 100 exomes (neurological disorders) Changes in ClinVar and OMIM GeneMap
  12. Shows Essential ProvONE fragment used by ReComp
  13. How these two restart trees are discovered is explained in the two papers IPAW BDC
  14. uses difference and impact services to analyse the impact of the changes on past executions and submits a subset of affected executions to rerun. HDB will have been discussed earlier Facts stored and queried using Prolog store/retrieve REST API. Canned queries or ad hoc queries (advanced interface) Impact functions realized as external services reachable through a REST API reExec function takes restart tress and executes them – this may not always be possible in fact it’s a major limitation for current systems ReComp loop produces recomp/no-recop decisions at the level of each restart tree Data diff is an additional external service