O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Analytics of analytics pipelines: from optimising re-execution to general Data Provenance for Data Science

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 38 Anúncio

Analytics of analytics pipelines: from optimising re-execution to general Data Provenance for Data Science

Baixar para ler offline

An invited talk (thanks!) given to the Office of National Statistics in March 2021

An invited talk (thanks!) given to the Office of National Statistics in March 2021

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Analytics of analytics pipelines: from optimising re-execution to general Data Provenance for Data Science (20)

Anúncio

Mais de Paolo Missier (20)

Mais recentes (20)

Anúncio

Analytics of analytics pipelines: from optimising re-execution to general Data Provenance for Data Science

  1. 1. Paolo Missier School of Computing Newcastle University, UK March 2021 Analytics of analytics pipelines: from optimising re-execution to general Data Provenance for Data Science paolo.missier@ncl.ac.uk LinkedIn: paolomissier Twitter: @PMissier
  2. 2. 2 Outline ONS March 2021 P. Missier 1. ReComp: a framework to enable the selective re-computation of expensive analytics workflows 2. Data Provenance for Data Science
  3. 3. 3 Context Big Data The Big Analytics Machine Actionable Knowledge Analytics Data Science over time V3 V2 V1 Meta-knowledge Algorithms Tools Libraries Reference datasets t t t
  4. 4. 4 What changes? Life Sciences, Health care Reference databases Algorithms and libraries Simulation Large parameter space Input conditions Machine Learning Evolving ground truth datasets Model re-training
  5. 5. 5 Motivating example: Genomics pipelines Image credits: Broad Institute https://software.broadinstitute.org/gatk/ https://www.genomicsengland.co.uk/the-100000-genomes-project/ Spark GATK tools on Azure: 45 mins / GB @ 13GB / exome: about 10 hours
  6. 6. 6 Genomics: WES / WGS, Variant calling  Variant interpretation SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer SVI: Simple Variant Interpretation Variant classification : pathogenic, benign and unknown/uncertain
  7. 7. 7 Changes that affect variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes
  8. 8. 8 Blind reaction to change: a game of battleship Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected 4.2 hours of computation per change Should we care about updates? Evolving knowledge about gene variations
  9. 9. 9 Reacting to changes in inputs x1 x2 y1 d1 d2 f() 1. Always refresh 2. Approximate
  10. 10. 10 3. A refresh-if-needed approach
  11. 11. 11 f(.) unstable  heuristics Impact: “Any variant with status moving from/to Red causes High impact on any patient who is affected by the variant” Observation: Variants v within output set y that are in scope for patient X remain in scope! (monotonicity) 1. Variant v changes status - unknown  benign - unknown  deleterious 2. Brand new variant  If in scope: compare status before / after inexpensive  recompute SVI on all inputs expensive Scope: which cases are affected? “a change in variant v can only have impact on a case X if V and X share the same phenotype”
  12. 12. 12 Empirical evaluation re-executions 495  71 Ideal: 14 But: no false negatives
  13. 13. 13 ReComp http://recomp.org.uk/ Outcome: A framework for selective Re-computation • Generic, Customisable Scope: expensive analysis + frequent data changes + not all data changes significant Challenge: Make re-computation efficient in response to changes Assumptions: Processes are • Observable • Reproducible Estimates are cheap Insight: replace re-computation with change impact estimation Using history of past executions
  14. 14. 14 Data Provenance in ReComp Hypothesis: collecting detailed {provenance, logs} from past executions helps optimizing future executions 2. Identify and re-execute the minimal fragments of workflow that have been affected 1. Identify the subset of executions that are potentially affected by the changes
  15. 15. 15 Reproducibility How Selective: - Across a cohort of past executions.  which subset of individuals? - Within a single re-execution  which process fragments? Change in ClinVar Change in GeneMap  Why, when, to what extent
  16. 16. 16 The ReComp meta-process History DB Detect and quantify changes data diff(d,d’) Record execution history Analytics Process P Log / provenance Partially Re-exec P (D) P’(D’) Change Events Changes: • Reference datasets • Inputs For each past instance: Estimate impact of changes Impact(dd’, o) impact estimation functions Scope Select relevant sub-processes Optimisation
  17. 17. 17 How much do we know about the process? Impact estimation Re-execution less more Process structure Execution trace black box I/O provenance IO, DO All-or-nothing monolithic process, legacy  a complex simulator white box step-by-step provenance workflows, R / python code  genomics analytics Typical process Fine-grained Impact Partial  restart trees (*) (*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018. London: Springer; 2018.
  18. 18. 18 Provenance of process executions Process, workflow run Data wasAssociatedWith cooking recipe chef finished dish wasGeneratedBy A plan plays a role in an association Activity: workflow run Data product Plan
  19. 19. 19 Execution trace / Provenance User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls [*] [*] [*] [*] [*] [*] «wasDerivedF [*] [*] [0..1] [0..1] [0..1] [*] [1] [*] [*] [0..1] [0..1] hasOutPort [*] [0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*] [0..1] connectsTo [*] [0..1] «wasInformedBy » [*] [1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPort hadInPort [*] [1] [1] [1] [1] [1] hadEntity hasDefaultParam
  20. 20. 20 History DB: Workflow Provenance Each invocation of a workflow generates a provenance trace “plan” “plan execution” WF B1 B2 B1exec B2exec Data WFexec partOf partOf usage generation association association association db usage Program Workflow Execution Entity (ref data)
  21. 21. 21 SVI implemented using workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  22. 22. 22 SVI – partial re-execution Overhead: caching intermediate data Time savings Partial re-exec (sec) Complete re-exec Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37 Change in ClinVar Change in GeneMap Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
  23. 23. 24 ReComp: Summary Evaluation: case-by-case basis - Cost savings - Ease of customisation Generic ReComp framework: - Observe changes, Provenance DB (History), control re-exec Customisation: - Diff functions, impact functions Fine-grained provenance + control  max savings
  24. 24. 25 Data Provenance for Data Science
  25. 25. 26 Data  Model  Predictions Model pre-processing Raw datasets features Predicted you: - Ranking - Score - Class Data collection Instances Key decisions are made during data selection and processing: - Where does the data come from? - What’s in the dataset? - What transformations were applied? Complementing current ML approaches to model interpretability 1. Can we explain these decisions? 2. Are these explanations useful?
  26. 26. 27 Explaining data preparation Data collection Model Population data pre-processing Raw datasets features Predicted you: - Ranking - Score - Class - Integration - Cleaning - Outlier removal - Normalisation - Feature selection - Class rebalancing - Sampling - Stratification - … Data acquisition and wrangling: - How were datasets acquired? - How recently? - For what purpose? - Are they being reused / repurposed? - What is their quality? Instances - Scripts  Python / TensorFlow, Pandas, Spark - Workflows  Knime, … Provenance  Transparency
  27. 27. 29 Typical operators used in data prep
  28. 28. 35 Recent early results A small grassroots project… [1] - Formalisation of provenance patterns for pipeline operators - Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines - Reality check: - How much does it cost?  provenance volume - Does it help?  queries against the provenance database [1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
  29. 29. 36 Operators 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Data reduction - Feature selection - Instance selection Data augmentation - Space transformation - Instance generation - Encoding (eg one-hot…) Data transformation - Data repair - Binarisation - Normalisation - Discretisation - Imputation Ex.: vertical augmentation  adding columns
  30. 30. 37 Code instrumentation Create a provlet for a specific transformation Initialize provenance capture …code injection is now being automated!
  31. 31. 38 Provenance patterns
  32. 32. 39 Provenance templates Template + binding rules = instantiated provenance fragment 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} +
  33. 33. 40 This applies to all operators…
  34. 34. 41 Putting it all together
  35. 35. 42 Evaluation - performance
  36. 36. 43 Evaluation: Provenance capture and query times
  37. 37. 44 Scalability
  38. 38. 45 Summary Multiple hypotheses regarding Data Provenance for Data Science: 1. Is it practical to collect fine-grained provenance? 1. To what extent can it be done automatically? 2. How much does it cost? 2. Is it also useful?  what is the benefit to data analysts? Work in progress! Interest? Ideas?

Notas do Editor

  • We are going to use this smaller process as a testbed
    Changes in the reference databases have an impact on the classification
  • returns updates in mappings to genes that have changed between the two versions (including possibly new mappings):


    $\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\
    where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$.

    \begin{align*}
    \diffCV&(\CV^t, \CV^{t'}) = \\
    &\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\
    & \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'}
    \label{eq:diff-cv}
    \end{align*}
    where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.


  • Threats: Will any of the changes invalidate prior findings?
    Opportunities: Can the findings be improved over time?

    Can we do better in a generic way?
    We need to control re-computation on two dimensions

    Across a population
    Within a single process
  • f: X \rightarrow Y
    \mathbf{y} = f(\mathbf{x}, D)

    \mathbf{d} \rightsquigarrow \mathbf{d’}

    \mathbf{x} = [x_1 \dots x_n]\\
     \mathbf{y} = [y_1 \dots y_m]
    \delta_Y(y,y') > \Delta_Y

    &\text{simply compute } \mathbf{y'} = f(\mathbf{x’}, \mathbf{d') \\
    &\text{inefficient if computing $f(.)$ is expensive, and}\\ 
    &\text{$y, y’$ turn out to be very similar to each other}


    &\text{find a new function } f’(.) \text{ that approximates } f(.) \\
    &\text{return } f'(\mathbf{x’}, \mathbf{d'})




  • &\text{Define a distance metric }\delta_Y \text{ on }Y \\
    &\text{try and estimate } \delta_Y(y,y') \text{ \emph{without explicitly computing} } y’ \\
    &\text{if } \delta_Y(y,y')> \Delta_Y \text{ for a set threshold } \Delta_Y \text{ then compute } f(\mathbf{x’})

    &\text{This approach works well when: }\\
    &\text{1. Distance metrics can be defined on both $X$ and $Y$: $\delta_X, \delta_Y$} \\
    &\text{2. $f(.)$ is \emph{stable}:} \quad {\delta_X(\mathbf{x},\mathbf{x'}) < \epsilon_X \Rightarrow \delta_Y(\mathbf{y},\mathbf{y'}) < \epsilon_Y}



  • \text{let } v \in \diff{Y}(Y^t, Y^{t'}): \\
    \text{for any $X$: } \impact_{P}(C,X) = \texttt{High} \text{ if }\\
    v.\texttt{status:}
    \begin{cases}
    * \rightarrow \texttt{red} \\
    \texttt{red} \rightarrow * 
    \end{cases}

  • Success criteria:

    performance, but this is on a case-by-case basis
    Ease of customization. The focus of this paper

  • The framework is a meta-process…
    Changes can also occur to OS, libraries and other dependencies but these are out of scope
  • The black box case is illustrated here and is less interesting.
    The more interesting SVI case is in the next slide
  • Shows Essential ProvONE fragment used by ReComp
  • This shows the good case of “Gerry box” workflow and box-level provenance

    SVI workflow with automated provenance recording
    Cohort of about 100 exomes (neurological disorders)
    Changes in ClinVar and OMIM GeneMap
  • How these two restart trees are discovered is explained in the two papers
    IPAW
    BDC
  • How about the data used to train / build the model?
  • Relatively easy to keep track of data pre-processing  provenance
  • \newcommand{\f}{\textbf{a}}
    \text{features}~ X=[\f_1 \ldots \f_k]

    \text{new features}~ Y=[\f'_1 \ldots \f'_l]

    \noindent new values for each row are  obtained by applying $f$\\ to values in the $X$ features

×