Strategies for Landing an Oracle DBA Job as a Fresher
ReComp:Preserving the value of large scale data analytics over time through selective re-computation
1. ReComp–KeeleUniversity
Dec.2016–P.Missier
ReComp:
Preserving the value of large scale data analytics over time
through selective re-computation
recomp.org.uk
Paolo Missier, Jacek Cala, Manisha Rathi
School of Computing Science
Newcastle University
Keele University, Dec. 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)
4. ReComp–KeeleUniversity
Dec.2016–P.Missier
4
Example: supervised learning
Meta-knowledge
Training
set
Model
learning
Classification
algorithms
Predictive
classifier
Background
Knowledge
(prior)
the training set is no longer representative of current data the model loses predictive power
Ex.: training set is a sample from social media stream (Twitter, Instagram, …)
• Incremental training: established (neural networks, Bayes classifiers, …)
• Incremental unlearning: some established work [1]
t
[1] Kidera, Takuya, Seiichi Ozawa, Shigeo Abe. “An Incremental Learning Algorithm of Ensemble Classifier Systems.” Neural
Networks, 2006, 6453–59. doi:10.1109/IJCNN.2006.247345.
[2] Polikar, R., L. Upda, S.S. Upda, and V. Honavar. “Learn++: An Incremental Learning Algorithm for Supervised Neural
Networks.” IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31, no. 4 (2001): 497–
508. doi:10.1109/5326.983933.
[3] Diehl, C.P., and G. Cauwenberghs. “SVM Incremental Learning, Adaptation and Optimization.” Proceedings of the
International Joint Conference on Neural Networks, 2003. 4, no. x (2003): 2685–90. doi:10.1109/IJCNN.2003.1223991.
6. ReComp–KeeleUniversity
Dec.2016–P.Missier
6
Analytics functions and their dependencies can be complex
Y = f(X, D) X inputs (vector of arbitrary data structures, “big data”)
D: vector of dependencies: libraries, reference data
Y outputs (vector of arbitrary data structures, “knowledge”)
Ex.:
machine learning
Using Python
and scikit-learn
Learn model to
recognise
activity pattern
Python 3
Ubuntu x.y.z
Azure VM
Model
training
Model
Scikit-learn
Numpy
Pandas
Ubuntu
on Azure
Dependencies
Training +
Testing
dataset
config
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variants
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs
7. ReComp–KeeleUniversity
Dec.2016–P.Missier
7
Complex NGS pipelines
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
8. ReComp–KeeleUniversity
Dec.2016–P.Missier
8
Problem size: HPC vs Cloud deployment
Configuration: HPC cluster (dedicated nodes):
3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM,
160 GB scratch space
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB
SSD
00:00
12:00
24:00
36:00
48:00
60:00
72:00
0 6 12 18 24
Responsetime[hh:mm]
Number of samples
3 eng (24 cores) 6 eng (48 cores)
12 eng (96 cores)
Big Data:
• raw sequences for Whole Exome Sequencing (WES): 5–20GB per patient
• processed in cohorts of 20–40 or close to 1 TB per cohort
• time required to process a 24-sample cohort can easily exceed 2 CPU months
• WES is about 2% of what the Whole Genome Sequencing analyses require
9. ReComp–KeeleUniversity
Dec.2016–P.Missier
9
Understanding change: threats and opportunities
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
• Threats: Will any of the changes invalidate prior findings?
• Opportunities: Can the findings from the pipelines be improved over time?
• Cost: Need to model future costs based on past history and pricing trends for virtual appliances
• Impact analysis:
• Which patients/samples are likely to be affected?
• How do we estimate the potential benefits on affected patients?
• Can we estimate the impact of these changes without re-computing entire cohorts?
Changes:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases (HGMD, ClinVar,
OMIM GeneMap, GeneCard,…)
10. ReComp–KeeleUniversity
Dec.2016–P.Missier
10
ReComp
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce
(analytics) processes
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
A decision support system for selectively re-computing complex analytics in reaction to
change
- Generic: not just for the life sciences
- Customisable: eg for genomics pipelines
11. ReComp–KeeleUniversity
Dec.2016–P.Missier
11
Challenges
3. Control How much control do we have on the system?
• Re-run: How often
• Total vs partial execution
• Input density / resolution / incremental update
• Eg nonmonotonic learning / unlearning
Change
Events
Diff(.,.)
functions
“business
Rules”
Optimal re-computation prioritisaton
Impact and Cost estimates
Reproducibility assessment
ReComp
Decision
Support
System
History of past
Knowledge Assets
1. Observability: To what extent can we observe the process and its execution?
• Process structure
• Data flow provenance
2. Detecting and quantifying changes:
• In inputs, dependencies, outputs diff() functions
18. ReComp–KeeleUniversity
Dec.2016–P.Missier
18
Estimators: formalisation and a possible approach
And local changes
Problem: f() computationally expensive
Approach: learn an approximation f’() of f(): a surrogate (emulator)
Sensitivity Analysis:
Given
Assess
where ε is a stochastic term that accounts for the error in approximating f, and is typically
assumed to be Gaussian
Learning f’() requires a training set { (xi, yi) } …
If f’() can be found, then we can hope to use it to approximate:
which can then be used to carry out sensitivity analysis
For simplicity
22. ReComp–KeeleUniversity
Dec.2016–P.Missier
22
Baseline: Blind recomputation
17 minutes / patient (single-core
VM)
Runtime consistent across different
phenotypes
Changes to GeneMap/ClinVar have
negligible impact on the execution
time
Run time [mm:ss]
GeneMap
version
2016-03-08 2016-04-28 2016-06-07
μ ± σ 17:05 ± 22 17:09 ± 15 17:10 ± 17
26. ReComp–KeeleUniversity
Dec.2016–P.Missier
27
Partial re-computation using input difference
Idea: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV) Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Bigger gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%
27. ReComp–KeeleUniversity
Dec.2016–P.Missier
29
Saving resources on stream processing
x1
x2
…
xk
xk+1
…
x2k
W1
W2
Raw
stream
windows
P
P
y1
y2
… Wi+1 Wi Comp / noComp
…
yi-h-1
yi-h h<i
P
y’i
yi-h
yi
Baseline stream processing
Conditional stream processing
- If we could predict that yi+1 will be similar to
yi, we could skip computing P(Wi+1), save
resources and instead deliver yi again
- Can we make optimal comp/noComp
decisions? What is required?
34. ReComp–KeeleUniversity
Dec.2016–P.Missier
36
Routes drift – comparing ranked lists
[1] Fagin, Ronald, Ravi Kumar, and D. Sivakumar. “Comparing Top K Lists.” SIAM Journal on
Discrete Mathematics 17, no. 1 (January 2003): 134–60. doi:10.1137/S0895480102412856.
P outputs a list of top most frequent/profitable routes
To compare lists we use the generalised Kendall’s tau (Fagin et al. [1])
Quantify how much the top-k changes between one window and the next
Input parameters determine stability / sensitivity:
K: how many routes
window size (e.g. 30’)
37. ReComp–KeeleUniversity
Dec.2016–P.Missier
39
Approach: ARIMA forecasting
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Actual normalised drift vs ARIMA forecast
Drift function: top-10, window size = 1h, date range = [20/Jan 00:00–25/Jan 17:00)
new-day Actual norm-drift ARIMA(1,0,2)[1,0,1] forecast
Drift prediction using time series forecasting
• This is the derived diff() time series!
• Autoregressive integrated moving average (ARIMA)
• Widely used and well understood, well supported
• Fast to compute
• Assumes normality of underlying random variable
Poor prediction: compute P too often or too rarely
38. ReComp–KeeleUniversity
Dec.2016–P.Missier
40
The next steps -- challenges
• Can we learn effective surrogate models and estimators of change
impact?
• diff() functions, estimators seem very problem-specific
• To what extent can the ReComp framework be made generic,
reusable, yet still useful?
• Metadata infrastructure: A DB of past executions history
• Reproducibility: What really happens when I press the “ReComp”
button?
39. ReComp–KeeleUniversity
Dec.2016–P.Missier
41
Summary and challenges
Forwards: React to changes
in data used by processes
Backwards: restore value
of knowledge outcomes
Re-compute
Selected outcomes
Es0mate:
- Benefit
- Cost of refresh
Quan0fy
knowledge
decay
Es0mate:
- Impact of changes
- Cost of refresh
Quan0fy
data
changes
Monitor data
changes
Input,
reference data
versioning
Op0mise /
Priori0se
Outcomes
Knowledge
outcomes
Provenance,
Cost
New ground
truth
Data change events
ReComp:
a meta-process to observe and control underlying analytics processes
40. ReComp–KeeleUniversity
Dec.2016–P.Missier
42
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp
relevant?
Proof of concept
experiments
Expected
optimisation
Dataflow,
experimental
science
Genomics - Rapid Knowledge
advances
- Rapid scaling up
of genetic testing
at population level
WES/SVI pipeline,
workflow
implementation
(eScience Central)
Timeliness and
accuracy of patient
diagnosis subject to
budget constraints
Time series analysis - Personal health
monitoring
- Smart city
analytics
- IoT data streams
- Rapid data drift
- Cost of computation
at network edge (eg
IoT)
NYC taxi rides
challenge (DEBS’15)
Use of low-power
edge devices when
outcome is
predictable and data
drift is low
Data layer
optimisation
Tuning of large-scale
Data management
stack
Optimal Data
organisation sensitive
to current data
profiles
Graph DB re-
partitioning
System throughput vs
cost of re-tuning
Model learning Applications of
predictive analytics
Predictive models are
very sensitive to data
drift
Twitter content
analysis
Sustained model
predictive power over
time vs retraining
cost
Simulation TBD repeated simulation.
Computationally
expensive but often
not beneficial
Flood modelling /
CityCat Newcastle
Computational
resources vs
marginal benefit of
new simulation model
41. ReComp–KeeleUniversity
Dec.2016–P.Missier
43
Observability / transparency
White box Black box
Structure
(static view)
Dataflow
- eScience Central, Taverna, VisTrails…
Scripting:
- R, Matlab, Python...
- Functions semantics
- Packaged components
- Third party services
Data
dependencies
(runtime
view)
Provenance recording:
• Inputs,
• Reference datasets,
• Component versions,
• Outputs
• Input
• Outputs
• No data dependencies
• No details on individual
components
Cost • Detailed resource monitoring
• Cloud £££
• Wall clock time
• Service pricing
• Setup time (eg model
learning)
42. ReComp–KeeleUniversity
Dec.2016–P.Missier
44
Project structure
• 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call
• Feb. 2016 - Jan. 2019
• 2 RAs fully employed in Newcastle
• PI: Dr. Missier, School of Computing Science, Newcastle University (30%)
• CO-Investigators (8% each):
• Prof. Watson, School of Computing Science, Newcastle University
• Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University
• Dr. Phil James, Civil Engineering, Newcastle University
Builds upon the experience of the Cloud-e-Genome project: 2013-2015
Aims:
- To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud
- To facilitate the adoption of reliable genetic testing in clinical practice
- A collaboration between the Institute of Genetic Medicine and the School of Computing
Science at Newcastle University
- Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure
for Research”
Notas do Editor
The times they are a’changin
Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
\noindent Program $P$ takes input $x$\\
depends on reference data resources $D = \{D_1 \ldots D_m\}$
\noindent Each execution $i: 1 \dots N$ operates:\\
- on a version of its input: $x_i^t$ \\
- on a state $d_j^t$ for each $D_j \in D$ \\
- with cost $c_i^t$
\begin{equation*}
\langle y_i^t, c_i^t \rangle = \exec(P, x_i^t, \{ d_1^t \dots d_m^t\})
\end{equation*}
\noindent \textbf{data version changes:}\\
- inputs $\update{x_i^{t'}}{x_i^t}$ \\
- dependencies: $\update{d_j^{t'}}{d_j^t}$: new release of $D_j$ at time $t'$.
%
\noindent \textbf{Diff functions:}\\
- $\diff{X}(x_i^t, x_i^{t'}) $ \\
- $\diff{Y}(y_i^t, y_i^{t'}) $ \\
- $\diff{D_j}(d_j^t,d_j^{t'}) $ e.g. added, removed, updated records
\noindent At time $t$:
\begin{equation*}
\langle y_i^{t}, c_i^{t} \rangle = \exec(P, x_i^{t}, d^{t})
\label{eq:refreshed-exec}
\end{equation*}
\noindent At time $t' > t$, change in dependency $\update{d_j^{t'}}{d_j^t}$:
\begin{equation*}
\langle y_i^{t'}, c_i^{t'} \rangle = \exec(P, x_i^{t}, d^{t'})
\end{equation*}
where $d^{t'} = \{ d_1^t \dots d_i^{t'} \dots d_m^t \}$
\noindent Impact of the change $\update{d_j^{t'}}{d_j^t}$:
\begin{equation*}
\mathit{imp}(\update{d_j^{t'}}{d_j^t}, y_i^t) = f_Y(\diff{Y}(y_i^t, y_i^{t'})) \in [0,1]
\label{eq:imp}
\end{equation*}
where function $f_Y()$ is type-specific (and domain-specific)
returns updates in mappings to genes that have changed between the two versions (including possibly new mappings):
$\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\
where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$.
\begin{align*}
\diffCV&(\CV^t, \CV^{t'}) = \\
&\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\
& \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'}
\label{eq:diff-cv}
\end{align*}
where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.
\noindent - $O^t = \{ y_1^t, \dots y_N^t\}$: set of all outcomes that are current at time $t$\\
change $\update{d_j^{t'}}{d_j^t}$ (for simplicity)
\noindent Select optimal $O_{rc}^t \subseteq O^t$ such that:
\begin{equation*}
\max_{O_{rc}^t \subset O^t} \sum_{y_i \in O_{rc}^t}\mathit{imp}(\update{d_j^{t'}}{d_j^t}, y_i^t) \text{,} \quad \sum_{i:1}^{N} c_i^{t'} \leq C
\end{equation*}
\begin{equation*}
\{ \langle \imphat(\update{d_j^{t'}}{d_j^t}, y_i^t), \hat{c}_i^{t'} \rangle | y_i^t \in O^t\}
\label{eq:imp-est}
\end{equation*}
\begin{align*}
\max_{O_{rc}^t \subset O^t} \sum_{y_i \in O_{rc}^t}\imphat(\update{d_j^{t'}}{d_j^t}, y_i^t) \textbf{,} \quad \sum_{i:1}^{N} \hat{c}_i^{t'} \leq C
\end{align*}
- analyse(W) runs analytics on a windowed stream W1, W2, …
- At time ti it produces and delivers output Oi = analyse(Wi)
Requires REComp pramble
\hat{y}_i = \begin{cases}
y_i \text{~if $y_i = P(W_i)$ is computed,} \\
y_{i-k} \text{~otherwise}
\end{cases}\\
\text{ where $y_{i-k}$ is the latest computed value} \\
Denote this value as $\mathit{surr}(y_i)$.
$\diffo: O \times O \rightarrow [0,1] $
with: $\diffo(y_i, _i) = 0$
Quality function
$q: O \times \mathbb{N} \rightarrow [0, q_{max}] $\\
$q(y_i, j)$ quantifies the currency of $y_i$ at time $j>i$. \\
\begin{equation*}
q(y_i, j) = \begin{cases}
q_{max} \text{~when $i=j$} \\
q_{max}- |\diffo(y_i, y_j)| \text{~otherwise}
\end{cases}
\end{equation*}
\begin{equation*}
\mathit{perf}(N) = \frac{\sum_{i:1}^{N}q(\mathit{surr(y_i), i)}}{N \cdot C}
\end{equation*}
$C = N \cdot c$, and \\
$\sum_{i:1}^{N}q(\mathit{act}(y_i), i) = N \cdot q_{max}$\\
because $act(y_i) = out_i$ for all $i$, thus \\
$\mathit{perf}(N) = \frac{q_{max}}{N \cdot c}$
-----
$q(\mathit{act}(y_i), i) = q(y_1, i) = q_{max} - \diffo(y_1, y_i)$\\
for each $i$, thus:\\
$\mathit{perf}(N) = \frac{1}{c} \Bigl[q_{max} - \frac{\sum_{i:1}^{N}\diffo(y_1, y_i)}{N} \Bigr ]$
We used the generalised Kendall’s tau as a well recognised, generic method to compare top-k lists.