Full paper: http://boole.diiga.univpm.it/paper/sac2012.pdf
In many experimental domains, especially e-Science, workflow management systems are gaining increasing attention to design and execute in-silico experiments involving data analysis tools. As a by-product, a repository of workflows is generated, that formally describes experimental protocols and the way different tools are combined inside experiments.
In this paper we describe the use of the SUBDUE graph clustering algorithm to discover sub-workflows from a repository. Since sub-workflows represent significant usage patterns of tools, the discovered knowledge can be exploited by scientists to learn by-example about design practices, or
to retrieve and reuse workflows. Such a knowledge, ultimately, leverages the potential of scientific workflow repositories to become a knowledge-asset. A set of experiments
is conducted on the
myExperiment repository to assess the
effectiveness of the approach.
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Mining Usage Patterns from a Repository of Scientific Workflow
1. Mining Usage Patterns from a Repository
of Scientific Workflows
Claudia Diamantini, Domenico Potena, Emanuele Storti
´
DII, Universita Politecnica delle Marche, Ancona, Italy
SAC 2012, Riva del Garda, March 26-30
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 1 / 21
2. Introduction
Process mining techniques allow for extracting knowledge from event
logs, i.e. traces of running processes:
audit trails of a workflow management system
transaction logs of an enterprise resource planning system
Main applications:
process discovery: what is really happening?
conformance check: are we doing what was agreed upon?
process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
3. Introduction
Process mining techniques allow for extracting knowledge from event
logs, i.e. traces of running processes:
audit trails of a workflow management system
transaction logs of an enterprise resource planning system
Main applications:
process discovery: what is really happening?
conformance check: are we doing what was agreed upon?
process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
4. Introduction
Process mining techniques allow for extracting knowledge from event
logs, i.e. traces of running processes:
audit trails of a workflow management system
transaction logs of an enterprise resource planning system
Main applications:
process discovery: what is really happening?
conformance check: are we doing what was agreed upon?
process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
5. Motivation
Little work about how to extract knowledge from a set of models (i.e.:
process schemas)
understanding of common patterns of usage (best/worst
practices)
support in process design
process integration (from multiple sources)
process optimization/re-engineering
indexing of processes and their retrieval
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 3 / 21
6. Methodology
Processes schemas have an inherent graph structure: many
graph-mining tools available
Approach
1 representation of original processes into graphs
2 usage of graph-mining techniques to extract SUBs
Which representation form?
Which specific technique?
Graph clustering techniques (sub-processes are clusters) with a
hierarchical approach (various levels of abstractions, more/less
specificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
7. Methodology
Processes schemas have an inherent graph structure: many
graph-mining tools available
Approach
1 representation of original processes into graphs
2 usage of graph-mining techniques to extract SUBs
Which representation form?
Which specific technique?
Graph clustering techniques (sub-processes are clusters) with a
hierarchical approach (various levels of abstractions, more/less
specificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
8. Methodology
Processes schemas have an inherent graph structure: many
graph-mining tools available
Approach
1 representation of original processes into graphs
2 usage of graph-mining techniques to extract SUBs
Which representation form?
Which specific technique?
Graph clustering techniques (sub-processes are clusters) with a
hierarchical approach (various levels of abstractions, more/less
specificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
9. Methodology
Processes schemas have an inherent graph structure: many
graph-mining tools available
Approach
1 representation of original processes into graphs
2 usage of graph-mining techniques to extract SUBs
Which representation form?
Which specific technique?
Graph clustering techniques (sub-processes are clusters) with a
hierarchical approach (various levels of abstractions, more/less
specificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
10. SUBDUE
SUBDUE [Joyner, 2000]
graph-based hierarchical clustering algorithm
suited for discrete-valued and structured data
searches for substructures (i.e., subgraphs) that best compress
the input graph, according to MDL
Comments:
to compress G with a SUB: to replace every occurrence of SUB in
G with a single new node
SUB best compresses G: num. of bits needed to represent G after
the compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
11. SUBDUE
SUBDUE [Joyner, 2000]
graph-based hierarchical clustering algorithm
suited for discrete-valued and structured data
searches for substructures (i.e., subgraphs) that best compress
the input graph, according to MDL
Comments:
to compress G with a SUB: to replace every occurrence of SUB
in G with a single new node
SUB best compresses G: num. of bits needed to represent G after
the compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
12. SUBDUE
SUBDUE [Joyner, 2000]
graph-based hierarchical clustering algorithm
suited for discrete-valued and structured data
searches for substructures (i.e., subgraphs) that best compress
the input graph, according to MDL
Comments:
to compress G with a SUB: to replace every occurrence of SUB in
G with a single new node
SUB best compresses G: num. of bits needed to represent G
after the compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
13. SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the Description
Length of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously defined
SUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
14. SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the Description
Length of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously defined
SUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
15. SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the Description
Length of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously defined
SUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
16. SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the Description
Length of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously defined
SUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
17. SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the Description
Length of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously defined
SUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
18. SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the Description
Length of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously defined
SUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
19. SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the Description
Length of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously defined
SUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
20. SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
interClusterDistance
quality = intraClusterDistance
Desirable features of a good clustering:
few and big clusters (large coverage, good generality)
minimal/no overlap (better defined concepts)
New indexes
completeness: % of original nodes/edges still present in the final
lattice
representativeness (of a SUB): % of input graphs holding the
SUB at least once
significance: qualitative measure of the meaningfulness of a
cluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
21. SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
interClusterDistance
quality = intraClusterDistance
Desirable features of a good clustering:
few and big clusters (large coverage, good generality)
minimal/no overlap (better defined concepts)
New indexes
completeness: % of original nodes/edges still present in the final
lattice
representativeness (of a SUB): % of input graphs holding the
SUB at least once
significance: qualitative measure of the meaningfulness of a
cluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
22. SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
interClusterDistance
quality = intraClusterDistance
Desirable features of a good clustering:
few and big clusters (large coverage, good generality)
minimal/no overlap (better defined concepts)
New indexes
completeness: % of original nodes/edges still present in the final
lattice
representativeness (of a SUB): % of input graphs holding the
SUB at least once
significance: qualitative measure of the meaningfulness of a
cluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
23. SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
interClusterDistance
quality = intraClusterDistance
Desirable features of a good clustering:
few and big clusters (large coverage, good generality)
minimal/no overlap (better defined concepts)
New indexes
completeness: % of original nodes/edges still present in the final
lattice
representativeness (of a SUB): % of input graphs holding the
SUB at least once
significance: qualitative measure of the meaningfulness of a
cluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
24. Representation
Approach
1 representation of original processes into graphs
2 usage of graph-mining techniques to extract SUBs
Common operators:
sequence
split-AND (parallel split), split-XOR (exclusive choice)
join-AND (synchronization), join-XOR (simple merge)
How to map the process to the graph? Several choices
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 8 / 21
25. Representation models: A rule
lowest level of compactness
every operator is represented by a node
called operator, linked to another node
specifying its type (AND/OR/SEQ) and to
its operands
join/split are distinguishable by the
number of in/out-going arcs
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 9 / 21
26. Representation models: B rule
medium level of compactness
the closest to the original representation
join/split are replaced by different nodes,
one for each kind of operator
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 10 / 21
27. Representation models: C rule
highest level of compactness
implicit representation of operators
multiple alternative graphs in case of
SPLIT-XOR or JOIN-XOR
arcs are labeled to preserve information
about domain/range nodes
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 11 / 21
28. Evaluation
Comments
the three representations hold the same information, with no loss
easy extension with more attributes (actor name/role, info about
time/resources, ...):
attribute name → edge
value → a node
Need of experimentations to assess which one performs best w.r.t.
the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
29. Evaluation
Comments
the three representations hold the same information, with no loss
easy extension with more attributes (actor name/role, info about
time/resources, ...):
attribute name → edge
value → a node
Need of experimentations to assess which one performs best w.r.t.
the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
30. Evaluation
Comments
the three representations hold the same information, with no loss
easy extension with more attributes (actor name/role, info about
time/resources, ...):
attribute name → edge
value → a node
Need of experimentations to assess which one performs best w.r.t.
the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
31. my
Experiment
Dataset:
258 processes (XScufl language for Taverna framework)
e-Science WF for distributed computation over scientific data
applications: local tools, Web Services, scripts
Setting:
WF extraction/parsing
preprocessing
translation into graphs (A,B,C
rules)
SUBDUE execution
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 13 / 21
33. Experimental results
A rule B rule C rule
Size (nodes) 5618 2871 2414
SUBs (num) 671 611 580
top SUBs 2.24% 52.37% 52.93%
Completeness 95.41% 90.90% 90.27%
Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%
Significance of top level clusters – + ++
Execution time (hh:mm) 03:38 00:06 00:03
Considerations:
A model: low significance, high execution time
No definitively best choice between B and C:
B model: higher representativeness → indexing
C model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
34. Experimental results
A rule B rule C rule
Size (nodes) 5618 2871 2414
SUBs (num) 671 611 580
top SUBs 2.24% 52.37% 52.93%
Completeness 95.41% 90.90% 90.27%
Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%
Significance of top level clusters – + ++
Execution time (hh:mm) 03:38 00:06 00:03
Considerations:
A model: low significance, high execution time
No definitively best choice between B and C:
B model: higher representativeness → indexing
C model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
35. Experimental results
A rule B rule C rule
Size (nodes) 5618 2871 2414
SUBs (num) 671 611 580
top SUBs 2.24% 52.37% 52.93%
Completeness 95.41% 90.90% 90.27%
Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%
Significance of top level clusters – + ++
Execution time (hh:mm) 03:38 00:06 00:03
Considerations:
A model: low significance, high execution time
No definitively best choice between B and C:
B model: higher representativeness → indexing
C model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
36. Experimental results
A rule B rule C rule
Size (nodes) 5618 2871 2414
SUBs (num) 671 611 580
top SUBs 2.24% 52.37% 52.93%
Completeness 95.41% 90.90% 90.27%
Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%
Significance of top level clusters – + ++
Execution time (hh:mm) 03:38 00:06 00:03
Considerations:
A model: low significance, high execution time
No definitively best choice between B and C:
B model: higher representativeness → indexing
C model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
37. Experimental results
A rule B rule C rule
Size (nodes) 5618 2871 2414
SUBs (num) 671 611 580
top SUBs 2.24% 52.37% 52.93%
Completeness 95.41% 90.90% 90.27%
Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%
Significance of top level clusters – + ++
Execution time (hh:mm) 03:38 00:06 00:03
Considerations:
A model: low significance, high execution time
No definitively best choice between B and C:
B model: higher representativeness → indexing
C model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
38. Experimental results
A rule B rule C rule
Size (nodes) 5618 2871 2414
SUBs (num) 671 611 580
top SUBs 2.24% 52.37% 52.93%
Completeness 95.41% 90.90% 90.27%
Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%
Significance of top level clusters – + ++
Execution time (hh:mm) 03:38 00:06 00:03
Considerations:
A model: low significance, high execution time
No definitively best choice between B and C:
B model: higher representativeness → indexing
C model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
39. Experimental results
A rule B rule C rule
Size (nodes) 5618 2871 2414
SUBs (num) 671 611 580
top SUBs 2.24% 52.37% 52.93%
Completeness 95.41% 90.90% 90.27%
Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%
Significance of top level clusters – + ++
Execution time (hh:mm) 03:38 00:06 00:03
Considerations:
A model: low significance, high execution time
No definitively best choice between B and C:
B model: higher representativeness → indexing
C model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
40. Experimental results
A rule B rule C rule
Size (nodes) 5618 2871 2414
SUBs (num) 671 611 580
top SUBs 2.24% 52.37% 52.93%
Completeness 95.41% 90.90% 90.27%
Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%
Significance of top level clusters – + ++
Execution time (hh:mm) 03:38 00:06 00:03
Considerations:
A model: low significance, high execution time
No definitively best choice between B and C:
B model: higher representativeness → indexing
C model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
41. Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse my Experiment to find every workflow with X:
too many results
given a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:
1 find all usage patterns for X (i.e., top SUBs containing X)
2 analyse their representativeness
3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
42. Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse my Experiment to find every workflow with X:
too many results
given a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:
1 find all usage patterns for X (i.e., top SUBs containing X)
2 analyse their representativeness
3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
43. Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse my Experiment to find every workflow with X:
too many results
given a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:
1 find all usage patterns for X (i.e., top SUBs containing X)
2 analyse their representativeness
3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
44. Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse my Experiment to find every workflow with X:
too many results
given a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:
1 find all usage patterns for X (i.e., top SUBs containing X)
2 analyse their representativeness
3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
45. Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse my Experiment to find every workflow with X:
too many results
given a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:
1 find all usage patterns for X (i.e., top SUBs containing X)
2 analyse their representativeness
3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
46. Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse my Experiment to find every workflow with X:
too many results
given a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:
1 find all usage patterns for X (i.e., top SUBs containing X)
2 analyse their representativeness
3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
47. Use case (cont’d)
Applications:
reference during process design (learn-by-example)
process retrieval: find every my Experiment WF containing such a
pattern
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21
48. Use case (cont’d)
Applications:
reference during process design (learn-by-example)
process retrieval: find every my Experiment WF containing such a
pattern
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21
49. Discussion
A graph-based clustering technique (SUBDUE) to recognize the most
frequent/common subprocesses among a set of workflows/process
schemas.
Comments
several representation alternatives for application of SUBDUE
evaluations on specialized e-Science domain
useful to find typical patterns and schemas of usage for tools/tasks
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 18 / 21
50. Conclusion
Applications:
recognition of reference processes (common/best/worst
practices)
organize a process repository (by indexing processes through
substructures)
enterprise integration: find similarities (differences, overlapping,
complementariness) in BP in different companies
Future work:
extending experimentations, especially with (real) Business
Processes
management of heterogeneity (syntactic/semantic, level of
granularity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 19 / 21
51. Mining Usage Patterns from a Repository
of Scientific Workflows
Claudia Diamantini, Domenico Potena, Emanuele Storti
´
DII, Universita Politecnica delle Marche, Ancona, Italy
SAC 2012, Riva del Garda, March 26-30
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 20 / 21
52. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
53. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
54. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
55. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
56. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
57. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
58. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
59. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
60. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
61. SUBDUE
Subdue uses a variant of beam search (heuristic)
Best SUB minimizes the Description Length of the graph
1 Search the best SUB
Init: set of SUBs consisting of all uniquely labeled vertices
extend each SUB in all possible ways by a single edge and a vertex
order SUBs according to MDL principle
keep only the top n SUBs
End: exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Subsequent SUBs may be defined in terms of previously defined SUBs
Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21