Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

Paolo Missier and Jacek Cala
Newcastle University, UK
IEEE Big Data Congress
Milan, Italy
July 8th, 2019
Efficient Re-computation of Big Data Analytics Processes
in the Presence of Changes
In collaboration with
• Institute of Genetic Medicine, Newcastle University
• School of GeoSciences, Newcastle University

2
Context
Big
Data
The Big
Analytics
Machine
Actionable
Knowledge
Analytics
Data Science over time V3
V2
V1
Meta-knowledge
Algorithms
Tools
Libraries
Reference
datasets
t
t
t

3
What changes?
• Genomics
• Reference databases
• Algorithms and libraries
• Simulation
• Large parameter space
• Input conditions
• Machine Learning
• Evolving ground truth datasets
• Model re-training

4
Genomics
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsengland.co.uk/the-100000-genomes-project/
Spark GATK tools on Azure:
45 mins / GB
@ 13GB / exome: about 10 hours

5
Genomics: WES / WGS, Variant calling  Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh,
M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
SVI: Simple Variant Interpretation
Variant classification : pathogenic, benign and unknown/uncertain

6
Blind reaction to change: a game of battleship
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes
detected
4.2 hours of computation per change
Should we care about updates?
Evolving knowledge about
gene variations

7
ReComp
http://recomp.org.uk/
Outcome:
A framework for selective Re-computation
• Generic, Customisable
Scope:
expensive analysis +
frequent changes +
not all changes significant
Challenge:
Make re-computation efficient in response
to changes
Assumptions:
Processes are
• Observable
• Reproducible
• Estimates are cheap
Insight: replace re-computation with change impact estimation
Using history of past executions

8
Reproducibility
How
Selective:
- Across a cohort of past executions.  which subset of individuals?
- Within a single re-execution  which process fragments?
Change in
ClinVar
Change in
GeneMap
 Why, when, to what extent

9
The rest of the talk
• Approach
• Architecture
• Evaluation (case study)

10
The ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instances:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation

11
How much do we know about P?
Impact estimation
Re-execution
less more
Process structure
Execution trace
black box
I/O provenance
I/O only
All-or-nothing
monolithic process, legacy
 a complex simulator
white box
step-by-step provenance
workflows, R / python code
 genomics analyticsTypical process
Fine-grained Impact
Partial  restart trees (*)
(*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs.
IPAW 2018. London: Springer; 2018.

12
SVI: data-diff and impact functions
- Data-specific
- Process-specificomim
clinvar
Overall
impact

13
Diff functions for SVI
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
Relational data  simple set difference

14
Example impact functions: SVI
Returns True iff:
- Known variants have moved in/out of Red status
- New Red variants have appeared
- Known Red variants have been retracted

15
ReComp decision matrix for SVI
Impact: yes / no / not assessed
delta functions: data diff detected?

16
Empirical validation
PaoloMissier2019
IEEEBigDataCongress
re-executions 495  71 Ideal: 14
But: no false negatives

17
SVI implemented using workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified
variants
Phenotype

18
Execution trace / Provenance
User Execution
«Association » «Usage» «Generation »
«Entity»
«Collection»
Controller Program
Workflow Channel
Port
wasPartOf
«hadMember »
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls[*]
[*]
[*]
[*] [*] [*]
«wasDerivedFrom »
[*][*]
[0..1]
[0..1]
[0..1]
[*][1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*][0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*][0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*][1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPorthadInPort
[*][1]
[1] [1]
[1] [1]
hadEntity
hasDefaultParam

19
SVI – restart trees
Overhead:
caching
intermediate data
Time savings Partial re-exec (sec) Complete re-exec Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
Change in
ClinVar
Change in
GeneMap
Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.

20
Architecture
<eventname>
ReComp Core
HDB
«ProvONE store»
Tabular-Diff
Service
Tabular-Diff
Service
Difference
Function
ReExecution
Service A
ReExecution
Service A
ReExecution
Function
Impact
Service B
Impact
Service B
Impact
Function
ReComp
Loop
User
Process
Runtime Environment
Inputs Outputs
Interface
Di f f Se r vi c e
Interface
I mpa c t Se r vi c e
Interface
Re Exe c Se r vi c e
Process and data provenance Prolog facts
store/retrieve
REST API
External services
REST API
Executes
restart trees
- React to
change events
- Construct
restart trees

21
Customising ReComp in practice
<eventname>
Enable
provenance
capture /
Map to PROV

22
Summary
<eventname>
Evaluation: case-by-case basis
- Cost savings
- Ease of customisation
Generic framework
Fine-grained provenance + control  max savings
Tested on two cases studies
- Genomics
- Simulation (flood modelling)  see paper

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

Similar to Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

Editor's Notes