Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes
1. Paolo Missier and Jacek Cala
Newcastle University, UK
IEEE Big Data Congress
Milan, Italy
July 8th, 2019
Efficient Re-computation of Big Data Analytics Processes
in the Presence of Changes
In collaboration with
• Institute of Genetic Medicine, Newcastle University
• School of GeoSciences, Newcastle University
3. 3
What changes?
• Genomics
• Reference databases
• Algorithms and libraries
• Simulation
• Large parameter space
• Input conditions
• Machine Learning
• Evolving ground truth datasets
• Model re-training
4. 4
Genomics
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsengland.co.uk/the-100000-genomes-project/
Spark GATK tools on Azure:
45 mins / GB
@ 13GB / exome: about 10 hours
5. 5
Genomics: WES / WGS, Variant calling Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh,
M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
SVI: Simple Variant Interpretation
Variant classification : pathogenic, benign and unknown/uncertain
6. 6
Blind reaction to change: a game of battleship
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes
detected
4.2 hours of computation per change
Should we care about updates?
Evolving knowledge about
gene variations
7. 7
ReComp
http://recomp.org.uk/
Outcome:
A framework for selective Re-computation
• Generic, Customisable
Scope:
expensive analysis +
frequent changes +
not all changes significant
Challenge:
Make re-computation efficient in response
to changes
Assumptions:
Processes are
• Observable
• Reproducible
• Estimates are cheap
Insight: replace re-computation with change impact estimation
Using history of past executions
8. 8
Reproducibility
How
Selective:
- Across a cohort of past executions. which subset of individuals?
- Within a single re-execution which process fragments?
Change in
ClinVar
Change in
GeneMap
Why, when, to what extent
9. 9
The rest of the talk
• Approach
• Architecture
• Evaluation (case study)
10. 10
The ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instances:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation
11. 11
How much do we know about P?
Impact estimation
Re-execution
less more
Process structure
Execution trace
black box
I/O provenance
I/O only
All-or-nothing
monolithic process, legacy
a complex simulator
white box
step-by-step provenance
workflows, R / python code
genomics analyticsTypical process
Fine-grained Impact
Partial restart trees (*)
(*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs.
IPAW 2018. London: Springer; 2018.
13. 13
Diff functions for SVI
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
Relational data simple set difference
14. 14
Example impact functions: SVI
Returns True iff:
- Known variants have moved in/out of Red status
- New Red variants have appeared
- Known Red variants have been retracted
15. 15
ReComp decision matrix for SVI
Impact: yes / no / not assessed
delta functions: data diff detected?
19. 19
SVI – restart trees
Overhead:
caching
intermediate data
Time savings Partial re-exec (sec) Complete re-exec Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
Change in
ClinVar
Change in
GeneMap
Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
22. 22
Summary
<eventname>
Evaluation: case-by-case basis
- Cost savings
- Ease of customisation
Generic framework
Fine-grained provenance + control max savings
Tested on two cases studies
- Genomics
- Simulation (flood modelling) see paper
Editor's Notes
We are going to ignore BDA in this talk
And also simulation although it’s a case study
We are going to use this smaller process as a testbed
Changes in the reference databases have an impact on the classification
Threats: Will any of the changes invalidate prior findings?
Opportunities: Can the findings be improved over time?
Can we do better in a generic way?
We need to control re-computation on two dimensions
Across a population
Within a single process
Success criteria:
performance, but this is on a case-by-case basis
Ease of customization. The focus of this paper
The framework is a meta-process…
Changes can also occur to OS, libraries and other dependencies but these are out of scope
The black box case is illustrated here and is less interesting.
The more interesting SVI case is in the next slide
This shows the good case of “Gerry box” workflow and box-level provenance
SVI workflow with automated provenance recording
Cohort of about 100 exomes (neurological disorders)
Changes in ClinVar and OMIM GeneMap
Shows Essential ProvONE fragment used by ReComp
How these two restart trees are discovered is explained in the two papers
IPAW
BDC
uses difference and impact services to analyse the impact of the changes on past executions and submits a subset of affected executions to rerun.
HDB will have been discussed earlier
Facts stored and queried using Prolog
store/retrieve REST API. Canned queries or ad hoc queries (advanced interface)
Impact functions realized as external services reachable through a REST API
reExec function takes restart tress and executes them – this may not always be possible in fact it’s a major limitation for current systems
ReComp loop produces recomp/no-recop decisions at the level of each restart tree
Data diff is an additional external service