Tapp11 presentation

Incremental workflow improvement
through analysis of its data provenance

Paolo Missier
School of Computing,
Newcastle University, UK

In collaboration with the eScience Central group at Newcastle:
Prof. Paul Watson, Dr. Simon Woodman, Dr Hugo Hiden, Dr. Jacek Cala

TAPP’11 workshop
Heraklion, Crete
June 20-21, 2011

Motivation
Despite the growing momentum around provenance as a premiere
form of metadata,
success exploitation stories trail models and technology advances

2

Motivation
Despite the growing momentum around provenance as a premiere
form of metadata,
success exploitation stories trail models and technology advances

• We are getting very good at recording provenance of data:
– multiple data models (OPM, Provenir, Janus, Karma, PML,...)
– provenance-aware system / service architectures ...
• PASS, Karma
– ... and workflows
• Kepler, Taverna, Galaxy, VisTrails,...

• But, what are systems/applications really doing with it?
– deliver value to users? i.e., in e-science, in the Web
• scientific reproducibility, quality, trust
– optimize system analysis, performance?
• enable partial re-run of resource-intensive processes
2

Broad goal and specific research context
A systematic study on methods and applications of mining / learning
techniques applied to large corpora of provenance metadata

Reinforcement Learning

Sensor readings Effectors

Cloud-based workflows come with a
sequent states of the managed element, and might
Autonomic manager

also include forecasts of any external processes
clear cost model that generate workflow or traffic processed by the Analyze Plan
managed element.
– valuable testbed readily available:
A key factor limiting rapid adoption and wide
usage of self-* managing systems such as the sys- Monitor Knowledge Execute
• real e-scienceFigure 1 is the difficulty of engineering a
tem in applications
• large provenanceaccurate knowledge in deployed sys-
sufficiently
graphs
achieve acceptable performance
module that can

– dynamic optimization today’s computing systems are high-
tems. Because requires many runs Managed element
ly complex and distributed, developing accurate
• Reference framework:
models of them is a potentially complex and time-
consuming task. Moreover, developing such mod- Figure 1. A standard autonomic computing monitor-analyze-plan-
els might require original research. For example, execute (MAPE) loop. The autonomic manager continually monitors
adaptive, self-managing software systems
queuing network models of multitier Internet serv- sensor readings, analyzes and plans management decisions, and
– systems that can dynamically adapt their behaviour in response via effectors. The MAPE components have
ices have only recently appeared in the literature.4 executes the decisions to changing conditions in
the inputs or in their environment onlyfull range of access to ato the likely effectiveness containing information
In particular, researchers are
approach proper treatment of the
beginning to
[1,2] pertaining
central knowledge base
of different management
computing systems’ complex dynamic behaviors decisions in achieving policy objectives.
within standard model-building frameworks. Stan-
[1] IBM. An architectural blueprint for autonomic computing. Tech. rep., IBM, 2011
dard control models, for example, model such
3 [2] Huebscher, M. effects onlyMcCann, J. A. A survey of autonomic computing - degrees, models, and
C., and approximately, and standard queuing To overcome the knowledge bottleneck in devel-
applications. ACMmodels ignore Surveys CSUR 40, 3 they oping self-* systems, ML approaches would ideally
Computing them entirely given that (2008), 1–28.

Broad goal and specific research context
A systematic study on methods and applications of mining / learning
techniques applied to large corpora of provenance metadata

• An opportunistic starting point:
optimization of resource-intensive, repetitive, provenance-aware e- Learning
Reinforcement
science workflows

Cloud-based workflows come with a
sequent states of the managed element, and might
Autonomic manager

also include forecasts of any external processes
clear cost model that generate workflow or traffic processed by the Analyze Plan
managed element.
– valuable testbed readily available:
A key factor limiting rapid adoption and wide
usage of self-* managing systems such as the sys- Monitor Knowledge Execute
• real e-scienceFigure 1 is the difficulty of engineering a
tem in applications
• large provenanceaccurate knowledge in deployed sys-
sufficiently
graphs
achieve acceptable performance
module that can

– dynamic optimization today’s computing systems are high-
tems. Because requires many runs Managed element
ly complex and distributed, developing accurate
• Reference framework:
models of them is a potentially complex and time-
consuming task. Moreover, developing such mod- Figure 1. A standard autonomic computing monitor-analyze-plan-
els might require original research. For example, execute (MAPE) loop. The autonomic manager continually monitors
adaptive, self-managing software systems
queuing network models of multitier Internet serv- sensor readings, analyzes and plans management decisions, and
– systems that can dynamically adapt their behaviour in response via effectors. The MAPE components have
ices have only recently appeared in the literature.4 executes the decisions to changing conditions in
the inputs or in their environment onlyfull range of access to ato the likely effectiveness containing information
In particular, researchers are
approach proper treatment of the
beginning to
[1,2] pertaining
central knowledge base
of different management
computing systems’ complex dynamic behaviors decisions in achieving policy objectives.
within standard model-building frameworks. Stan-
[1] IBM. An architectural blueprint for autonomic computing. Tech. rep., IBM, 2011
dard control models, for example, model such
3 [2] Huebscher, M. effects onlyMcCann, J. A. A survey of autonomic computing - degrees, models, and
C., and approximately, and standard queuing To overcome the knowledge bottleneck in devel-
applications. ACMmodels ignore Surveys CSUR 40, 3 they oping self-* systems, ML approaches would ideally
Computing them entirely given that (2008), 1–28.

Research
Hypothesis: Provenance traces recorded for past runs of a workflow can be
used to make future runs more efficient

Approach: Add adaptive control to an existing workflow, with provenance
analysis at its core → new recommender task

4

Research
Hypothesis: Provenance traces recorded for past runs of a workflow can be
used to make future runs more efficient

Approach: Add adaptive control to an existing workflow, with provenance
analysis at its core → new recommender task

Applicable for instance to a “Panel of Experts” (Poe) input queue

pattern:
– N experts are activated on same inputs, best outputs are
selected

Provenance used for incremental correlation
provenance
case base

of the inputs to the experts’ success rate
Panel of Experts

recommender

- Provenance of run i indicates which
experts performed well on their input expert
online
select

- Adaptively pruning the process space:
Expert 1 Expert n provenance
analysis

on run i+1, use provenance of output results
merge

computed by runs 1..n to select/prioritize
invocation of experts output queue

4

Case study: DiscoveryBus workflow
• QSAR: Quantitative structure-activity relationships
– at forefront of Chemical Engineering research
• OpenQsar project (http://www.openqsar.com):
– Establish correlations between the structure of molecular compounds and some of
their associated activities (toxicity, solubility, etc) (i)
DS

• DiscoveryBus: a workflow
implementation of OpenQSAR
Feature Selection
– eScience Central cloud-based
workflow system
– datasets DS(i) are a family of FS(i)1 FS(i)k
structurally homogeneous molecules
– Feature Selection extracts few
Model Discovery
relevant features from DS(i)
– Each learning scheme MB1...MBh
generates a different predictive MB1 MBh
models for molecular activity

Repetitive and resource-intensive:
Workflow execution repeated over about M(i)11 M(i)1h M(i)k1 M(i)kh
10K different input datasets
5

Role of provenance in DiscoveryBus
(i)
DS

• Connection to Panel of Experts:
Feature Selection
– Experts model builders MBi
– Experts outcome quality of generated (i)
FS 1
(i)
FS k
model (predictive power, stability)

• Optimization goal: Model Discovery

– to prioritize invocation of the MBi based on
their past performance on inputs similar to MB1 MBh

FS(i)j

M(i)11 M(i)1h M(i)k1 M(i)kh

Provenance correlates quality of output models M(i)jk to intermediate
feature sets FS(i)j:

(i ) W asGeneratedBy (i ) W asDerivedF rom
DS (i )
used
Mjh → MB h → FSj →

6

The recommender
• One Quality Matrix QMFS is associated to each Feature Set FS
• QMFS[MBi] encodes the success history of model builder MBi in the workflow
every time FS is used as input:

QM FS [MB h ] = G, B
input queue

• G (resp B): number of times MBh has been
observed to produce a good (resp. bad)
model when applied to input FS provenance
case base

• QMFS is updated every time FS appears in Panel of Experts

recommender

the provenance graph
• The builders’ historical success rate G expert

induces a dynamic partial order on the MBi Expert 1
select
Expert n
online
provenance
analysis

results
merge

• For each run, the Recommender:
output queue
• intercepts FS in the flow
• returns partial order from QMFS

7

Early experimental results
$) ) + ) ,
) $) ) + ) ,
)

#) + ) ,
) #) + ) ,
)

")+),
) ")+),
)
;89400 278- - +=;>29

()+),
) ()+),
)
0

*) + ) ,
) *)+ ) ,
)
<6

!)+),
) ! )+),
)
- . / 0 12-
. 1 31 50
40 $
6 0 278- - 29: . 1

- . / 0 12-
. 1 31 50
40 ’
&) + ) ,
) &) + ) ,
) - . / 0 12-
. 1 31 50
40 %
- . / 0 12-
. 1 31 50
40 &
%+),
) ) %+),
) )
786 2716

’ )+),
) ’ )+),
)

$) + ) ,
) $) + ) ,
)

)+),
) )+),
)
# ! & &’ "% $) & # $ ) & )& ) ’ )& % "# (% #" (’ " # % (
!" #$
% $& "# (# !" !% ’" () )$
% *’ "% *"
%
’ "# $# #&
%
"$ ** &( ’# ’ *! ’ (% ) (% " )$
$# ’ % & ! % *$ () " "# # $) $$ $’ $’ $% $& $! $* $( $" $# $#
6278- - 29: . 1 98+
;890

• max_attempts is the accuracy/resources trade-off parameter
– max_attempts = n: only first n out of H model builders are invoked
• Chart shows net accuracy over the entire available history of runs
– success rate / number of recommendations given

8

Limitations and ongoing work

• Approach suffers when FS space is % ###
"

sparse %####

$" ###
– most FS seen only once

0 ; ) , ( 15
$####
• Recommender abstains when QMFS ! " ###
not sufficiently populated

-,
! ####
• This is the main hurdle to successful " ###
optimization #
! " !# !" $# $" %# %" &# &" ’ &"
( ) * +, -./ 0 0 -, ( 1, 2.3 .4( 5.26 78 .9:
.-, , / ( ,

• Strategy: increase space density by clustering the FS
• needs a distance metric over the set of FS
• hierarchical clustering should provide a way to experiment with
accuracy/efficiency trade-offs

9

Short summary
• What happens when you try to apply mining/learning techniques to
large corpora of provenance metadata?
– any interesting applications / use cases?
– which techniques apply?
– are there significant research challenges, or an orchard of low-hanging fruits?

• privacy in provenance mining
• what provenance models lend themselves well to mining
• ...

10

Tapp11 presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (9)

Semelhante a Tapp11 presentation

Semelhante a Tapp11 presentation (20)

Mais de Paolo Missier

Mais de Paolo Missier (20)

Tapp11 presentation

Notas do Editor