Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

Paolo Missier and Jacek Cala
Newcastle University, UK
Universidad La Rioja, Spain
Oct 29th, 2019
Efficient Re-computation of Big Data Analytics Processes
in the Presence of Changes
In collaboration with
• Institute of Genetic Medicine, Newcastle University

2
Context
Big
Data
The Big
Analytics
Machine
Actionable
Knowledge
Analytics
Data Science over time V3
V2
V1
Meta-knowledge
Algorithms
Tools
Libraries
Reference
datasets
t
t
t

3
What changes?
• Genomics
• Reference databases
• Algorithms and libraries
• Simulation
• Large parameter space
• Input conditions
• Machine Learning
• Evolving ground truth datasets
• Model re-training

4
Genomics
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsengland.co.uk/the-100000-genomes-project/
Spark GATK tools on Azure:
45 mins / GB
@ 13GB / exome: about 10 hours

6
Genomics: WES / WGS, Variant calling  Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh,
M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
SVI: Simple Variant Interpretation
Variant classification : pathogenic, benign and unknown/uncertain

7
Changes that affect variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
Evolution in number of variants that affect patients
(a) with a specific phenotype
(b) Across all phenotypes

10
Blind reaction to change: a game of battleship
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes
detected
4.2 hours of computation per change
Should we care about updates?
Evolving knowledge about
gene variations

11
ReComp
http://recomp.org.uk/
Outcome:
A framework for selective Re-computation
• Generic, Customisable
Scope:
expensive analysis +
frequent changes +
not all changes significant
Challenge:
Make re-computation efficient in response
to changes
Assumptions:
Processes are
• Observable
• Reproducible
Estimates are cheap
Insight: replace re-computation with change impact estimation
Using history of past executions

12
Reproducibility
How
Selective:
- Across a cohort of past executions.  which subset of individuals?
- Within a single re-execution  which process fragments?
Change in
ClinVar
Change in
GeneMap
 Why, when, to what extent

13
The rest of the talk
• Approach,
• Exemplified on the SVI workflow
• Architecture

14
The ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record
execution history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instances:
Estimate impact
of changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation

15
How much do we know about P?
Impact estimation
Re-execution
less more
Process structure
Execution trace
black box
I/O provenance
I/O only
All-or-nothing
monolithic process, legacy
 a complex simulator
white box
step-by-step provenance
workflows, R / python code
 genomics analyticsTypical process
Fine-grained Impact
Partial  restart trees (*)
(*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs.
IPAW 2018. London: Springer; 2018.

16
Recomp meta-process flow
PaoloMissier2019
Identify the subset of executions that are
potentially affected by the changes
Determine whether changes may have
had any impact on outputs
Identify and re-execute the
minimal fragments of workflow
that have been affected

17
Change Front
t
C0
{a1 → a0}
CF3
{a3, b1, c2}
CF5
{a3, b2, c2, d1}
C1
{b1 → b0}
C3
{a3 → a2, c2 → c1}
C4
{d1 → d0}
C5
{b2 → b1}
C2
{a2 → a1, c1 → c0}
E(…, [a0, b0, e0])
E(…, [a0, b1, d0])
E(…, [a2, b1, c1])

18
Re-computation Front
We use:
wasInformedBy(..., [prov:type=“recomp:re-execution”])
to denote a ReComp-initiated re-execution.

20
Re-computation Front
…
user-initiated

22
Restart Tree
Re-computation front handles single executions well.
What if the process is more complex than that?
pipeline, workflow, complex hierarchical workflow… cf. the NGS pipeline.

25
Restart Tree
 The provenance trace includes
multiple interrelated executions.
 During re-execution we have to
combine all of them within a single
context – the top-level execution.

26
Execution trace / Provenance
User Execution
«Association » «Usage» «Generation »
«Entity»
«Collection»
Controller Program
Workflow Channel
Port
wasPartOf
«hadMember »
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls[*]
[*]
[*]
[*] [*] [*]
«wasDerivedFrom »
[*][*]
[0..1]
[0..1]
[0..1]
[*][1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*][0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*][0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*][1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPorthadInPort
[*][1]
[1] [1]
[1] [1]
hadEntity
hasDefaultParam

27
ProvONE
User Execution
«Association » «Usage» «Generation »
«
Controller Program
Workflow Channel
Port
wasPartOf
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls[*]
[*]
[*]
[*] [*] [
[0..
[0..1]
[*][1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*][0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*][0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*][1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPorthadInPort
[*][1]
[1] [1]
[1]
hadEntity
hasDefaultP

28
Restart Tree
To build a restart tree we rely on the provone:wasPartOf statements.
CF = {b2, e1}

29
Restart Tree
Captures the vertical dimension of a single execution
the transitive closure of the wasPartOf relation.
CF = {b2, e1}

30
Recomp meta-process flow
PaoloMissier2019
Determine whether changes may have
had any impact on outputs

31
Changes, data diff, impact
1) Observed change events:
(inputs, dependencies, or both)
3) Impact of change C on output y:
2) Type-specific Diff functions:
Impact is process- and data-specific:

32
Impact
Given P (fixed), a change in one of the inputs to P: C={xx’} affects a single output:
However a change in one of the dependencies: C= {dd’}
affects all outputs yt where version d of D was used

33
SVI: data-diff and impact functions
- Data-specific
- Process-specificomim
clinvar
Overall
impact
impact on ‘p1 select genes’
impact on the SVI output

34
Diff functions for SVI
ClinVar
1/2016
ClinVar
1/2017
diff
(unchanged)
Relational data  simple set difference

35
Binary SVI impact function
Returns True iff:
- Known variants have moved in/out of Red status
- New Red variants have appeared
- Known Red variants have been retracted

36
Impact: “semantic “ example
Scope: which cases are affected?
- Individual variants have an associated phenotype.
- Patient cases also have a phenotype
“a change in variant v can only have impact on a case X if V and X share
the same phenotype”
Importance: “Any variant with status moving from/to Red causes High impact on
any X that is affected by the variant”

37
Change impact analysis algorithm
PaoloMissier2019
Aim:
To identify the minimal subset of observed changes that have an actual effect on past outcomes
This is done by progressively eliminating changes for which impact has been estimated as null
Intuition:
- From the workflow, derive an impact graph
- This is a new type of dataflow where execution semantics is designed to
- Propagate input changes
- Compute diff functions
- Compute impact functions on diffs
- When impact is null, eliminate changes from the inputs
- Input: set of changes, eg
- Output: set of bindings that indicates which changes are relevant and have non-zero impact
on the process

38
Role of provenance
PaoloMissier2019
Impact facts:
- During each execution ReComp records port-data bindings for all the data that flow
through annotated input and output ports
- Each impact function is able to use some of these bindings as its own inputs
- These are the impact facts that the function is evaluated on
- To find these bindings, traverse the dependencies of impact to diff functions

39
ReComp decision matrix for SVI
Impact: yes / no / not assessed
delta functions: data diff detected?

40
Empirical validation
PaoloMissier2019
re-executions 495  71 Ideal: 14
But: no false negatives

41
SVI implemented using workflow
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified
variants
Phenotype

42
History DB: Workflow Provenance
Each invocation of an eSC workflow generates a provenance trace
“plan”
“plan
execution”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage
ProgramWorkflow
Execution
Entity
(ref data)

43
Partial re-execution
1. Change detection: A provenance fact indicates that a new version Dnew of
database d is available wasDerivedFrom(“db”,Dnew)
:- execution(WFexec), wasPartOf(Xexec,WFexec), used(Xexec, “db”)
2.1 Find the entry point(s) into the workflow, where db was used
:- execution(WFexec), execution(B1exec), execution(B2exec),
wasPartOf(B1exec, WFexec), wasPartOf(B1exec, WFexec),
wasGeneratedBy(Data, B1exec), used(B2exec,Data)
2.2 Discover the rest of the sub-workflow graph (execute recursively)
2. Reacting to the change:
Provenance
pattern:
“plan”
“plan
execution”
Ex. db = “ClinVar v.x”
WF
B1 B2
B1exec B2exec
Data
WFexec
partOf
partOf
usagegeneration
association association
association
db
usage

44
SVI – partial re-execution
Overhead:
caching
intermediate data
Time savings Partial re-exec (sec) Complete re-exec Time saving (%)
GeneMap 325 455 28.5
ClinVar 287 455 37
Change in
ClinVar
Change in
GeneMap
Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.

45
Architecture
<eventname>
ReComp Core
HDB
«ProvONE store»
Tabular-Diff
Service
Tabular-Diff
Service
Difference
Function
ReExecution
Service A
ReExecution
Service A
ReExecution
Function
Impact
Service B
Impact
Service B
Impact
Function
ReComp
Loop
User
Process
Runtime Environment
Inputs Outputs
Interface
Di f f Se r vi c e
Interface
I mpa c t Se r vi c e
Interface
Re Exe c Se r vi c e
Process and data provenance Prolog facts
store/retrieve
REST API
External services
REST API
Executes restart
trees
- React to
change events
- Construct
restart trees

46
Customising ReComp in practice
<eventname>
Enable
provenance
capture /
Map to PROV

47
Summary
<eventname>
Evaluation: case-by-case basis
- Cost savings
- Ease of customisation
Generic framework
Black box  gray box
Fine-grained provenance + control  max savings
Tested on two cases studies
- Genomics
- Flood Simulation (not in this talk)

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

Semelhante a Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes (20)

Mais de Paolo Missier

Mais de Paolo Missier (20)

Último

Último (20)

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

Notas do Editor