Breaking the Kubernetes Kill Chain: Host Path Mount
Process Mining for ERP Systems
1. Erik Nooijen,
Boudewijn v. Dongen, Dirk Fahland
Process Mining for ERP Systems
2. Process Discovery
process
event process
discovery
log model
algorithm
c1: A B C D E assumptions
c2: A C B D E • case = sequence of events of this case
c3: A F D E • cases are isolated:
event A in c1 happens only in c1 (and not in c2)
…
• cases of the same process
• one unique case id,
• each event associated to exactly one case id
PAGE 1
3. Typical Process in an ERP System
Manufacturer
Material A Material B
order
Material B Material B
product X order
Alice materials
ACME Inc.
Material B Material A
order
Material C Material C
product Y order
Bob
materials
Build to Order Mega Corp.
PAGE 2
4. n-to-m relations database
process
process
discovery
model
algorithm
id attributes time-stamp attributes ProductOrder Customer
poID cust. … created processed built shipped cust. address …
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 Alice … …
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 Bob … …
relations data attributes
OrderedMaterial id attributes MaterialOrder
poID moID type added moID suppl. … completed sent received
po1 mo3 B 30-08 13:13 mo3 ACME 30-08 13:15 30-08 14:15 01-09 9:05
po1 mo4 A 30-08 13:14 mo4 MEGA 30-08 13:17 30-08 16:12 01-09 10:13
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16 relations
PAGE 3
5. Process Discovery for ERP Systems
process
process
discovery
model
algorithm
0..*
Customer
reality: data in a relational DB
ProductOrder - cust
1
-… • events stored as time-stamped
- poID
- cust attributes in tables
- created OrderedMat.
MaterialOrder
- processed - poID
- built 1
- moID
- moID • multiple primary keys
- shipped 1..* - supplier multiple notions of case
- type
1..* - completed
- added 1
- sent
- received • tables are related
one event related to
multiple cases
PAGE 4
6. Process Discovery for ERP Systems
process
process
discovery
model
algorithm
0..*
Customer
reality: data in a relational DB
ProductOrder - cust
1
-… • events stored as time-stamped
- poID
- cust attributes in tables
- created OrderedMat.
MaterialOrder
- processed - poID
- built 1
- moID
- moID • multiple primary keys
- shipped 1..* - supplier multiple notions of case
- type
1..* - completed
- added 1
- sent
- received • tables are related
one event related to
multiple cases
PAGE 5
7. Outline
process
model
related by
primary foreign-key
relations
decompose by primary keys
model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 6
8. Find Artifact Schemas
process
model
related by
primary foreign-key
relations
decompose by primary keys
model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 7
9. Step 0: discover database schema
document schema vs. actual schema identify
• column types (esp. time-stamped columns)
• primary keys
• foreign keys
various (non-trivial) techniques available
key discovery is NP-complete in the size of the
table(s)
result:
PAGE 8
10. Step 1: decompose schema into processes
= schema summarization find:
1. sets of
corresponding
tables
2. links between
those
ProductOrder MaterialOrder
PAGE 9
11. Automatic Schema Summarization
= group similar tables
through clustering
define a distance between
any 2 tables
• by relations
• by information content
tables that are close to
each other
same cluster
# of clusters: user input
PAGE 10
12. Automatic Schema Summarization
1. structural distance A
between tables 1
2 fanout: 1 = (2+0)/2
fanout ~ avg. # of child fanout: 1
records related to the fanout: 2
same parent record
A B A B A B
1 X 1 X 1 X
2 Y 1 Y 1 Y
2 Z
2 U
PAGE 11
13. Automatic Schema Summarization
1. structural distance A
between tables 1
2 fanout: 1
fanout ~ avg. # of child fanout: 1 m.fr: 2 = 1/ (1/2)
records related to the m.fr: 1 fanout: 2
same parent record m.fr: 1
A B A B A B
matched fraction ~ 1 X 1 X 1 X
1 / (fraction of records in 2 Y 1 Y 1 Y
parent with matching child 2 Z
record) 2 U
PAGE 12
14. Grouping by Clustering
1. structural distance
2. information distance
importance of each table
= entropy (is maximal if all
records are different)
distance: 2 tables with high
entropies large distance
3. weighted distance by
structure + information
4. k-means clustering: most important table of cluster
k clusters based on = table with least distance to all
key attribute of the cluster
weighted distance
PAGE 13
15. Artifact Schema Artifact Log
process
model
related by
primary foreign-key
relations
decompose by primary keys
model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 14
16. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
poID cust. … created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2: po2 mo4 C 30-08 13:16
PAGE 15
17. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
time-stamped attribute event
poID cust. … created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, …) po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
PAGE 16
18. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
time-stamped attribute event
related attributes event attributes
poID cust. … created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
PAGE 17
19. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
time-stamped attribute event
related attributes event attributes
poID cust. … created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14
(processed, poID=po1, time=30-08 13:12, …) po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
PAGE 18
20. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
time-stamped attribute event
related attributes event attributes
poID cust. … created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14
(processed, poID=po1, time=30-08 13:12, …) po2 mo3 B 30-08 13:15
(added, poID=po1, time=30-08 13:13, moID=mo3, …)po2 mo4 C 30-08 13:16
refers to artifact “MaterialOrder”
PAGE 19
21. Outline
process
model
compose by
primary foreign-key
relations
decompose by primary keys
model f.
log f. discovery order
log f. model f.
order
quote quote
discovery
PAGE 20
22. Resulting Model(s)
Product Order Material Order
1..*
added
create
completed
processed
added 1..* sent
built
received
shipped
(addded, poID=po1, …, moID=mo3)
PAGE 21
26. Open issues
performance
• key discovery: NP-complete in R (# of records)
• foreign key discovery: NP-complete in R2
• problem is in the “hard part” of NP
• sampling of data, domain knowledge, semi-automatic
requires good database structure
• proper relations, proper keys
• otherwise wrong clusters are formed
• events don’t get right attributes
• semi-automatic approach
events shared by multiple cases… working on it…
PAGE 25
27. Erik Nooijen,
Boudewijn v. Dongen, Dirk Fahland
Process Mining for ERP Systems