1. Introduction to Machine
Learning
Lecture 14
Advanced Topics in Association Rules Mining
Albert Orriols i Puig
aorriols@salle.url.edu
i l @ ll ld
Artificial Intelligence – Machine Learning
Enginyeria i Arquitectura La Salle
gy q
Universitat Ramon Llull
2. Recap of Lecture 13
Ideas come from the market basket analysis (
y (MBA)
)
Let’s go shopping!
Milk, eggs, sugar,
bread
Milk, eggs, cereal, Eggs, sugar
bread
bd
Customer1
Customer2 Customer3
What do my customer buy? Which product are bought together?
Aim: Find associations and correlations between t e d e e t
d assoc at o s a d co e at o s bet ee the different
items that customers place in their shopping basket
Slide 2
Artificial Intelligence Machine Learning
3. Recap of Lecture 13
Itemset sup
Itemset sup
Database TDB
Dtb {A} 2 L1 {A} 2
C1
Tid Items {B} 3
{B} 3
10 A, C
A C, D {C} 3
{C} 3
1st scan
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
Itemset sup
C2 C2
Itemset
te set
{A,
{A B} 1
L2 2nd scan
Itemset sup {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B,
{B C} 2
{A, E}
{B, C} 2
{B, E} 3
{B, C}
{B, E} 3
{C, E} 2
{C, E} 2 {B,
{B E}
{C, E}
Itemset
te set L3
C3 3rd scan Itemset
It t sup
{B, C, E}
{B, C, E} 2
Slide 3
Artificial Intelligence Machine Learning
4. Recap of Lecture 13
Challenges
g
Apriori scans the data base multiple times
Most ft
M t often, there is a high number of candidates
th i hi h b f did t
Support counting for candidates can be time expensive
Several methods try to improve this points by
Reduce the number of scans of the data base
Shrink the number of candidates
Counting the support of candidates more efficiently
Slide 4
Artificial Intelligence Machine Learning
5. Today’s Agenda
Starting a journey through some advanced
topics in ARM
Mining frequent patterns without candidate
generation
Multiple Level AR
Sequential Pattern Mining
Quantitative association rules
Mining class association rules
Beyond support & confidence
B d t fid
Applications
Slide 5
Artificial Intelligence Machine Learning
6. Revisiting Candidate Generation
Remember A priori?
p
Use the previous frequent itemsets (k-1) to generate the k-
itemsets
te sets
Count itemsets support by scanning the data base
Bottleneck in the process: Candidate generation
Suppose 100 items
First level of the tree 100 nodes
⎛100 ⎞
Second level of the tree ⎜
⎜2⎟ ⎟
⎝ ⎠
⎛100 ⎞
⎜
⎜k⎟
In general, number of k-itemsets:
⎟
⎝ ⎠
Slide 6
Artificial Intelligence Machine Learning
7. Can We Avoid Generation?
Build an auxiliar structure to get statistics about the
g
itemsets in order to avoid candidate generation
Use an FP-tree
FP tree
Avoid multiple scans of the data
Divide-and-conquer methodology
Avoid candidate generation
Outline of the process:
Generate an FP-Tree
Mine the FP-tree
Slide 7
Artificial Intelligence Machine Learning
8. Building the FP-Tree
TID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P}
2 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B}
4 {B,C,K,S,P} {C,B,P}
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P}
Scan the DB for the first time and identify frequent itemsets. They
are: <(f:4),(c:4), (a:3),(b:3),(m:3),(p:3)>
We sort the items according to their frequency in the last column
Slide 8
Artificial Intelligence Machine Learning
9. Building the FP-Tree
After reading TID1:
TID Items Sorted FIS
root
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P}
F:1
2 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B} C:1
4 {B,C,K,S,P} {C,B,P}
A:1
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P}
M:1
P:1
Scan again the DB to build the tree
g
Slide 9
Artificial Intelligence Machine Learning
10. Building the FP-Tree
After reading TID2:
TID Items Sorted FIS
root
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P}
F:2
2 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B} C:2
4 {B,C,K,S,P} {C,B,P}
A:2
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P}
B:1
M:1
B:1
P:1
Slide 10
Artificial Intelligence Machine Learning
14. Building the FP-Tree
TID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P}
root
2 {A,B,C,F,L,M,O} {F,C,A,B,M}
F:4 C:1
3 {B,F,H,J,O} {F,B} Item
B:1
4 {B,C,K,S,P} {C,B,P} F
C:3
C3 B:1
B1
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P} C
A
A:3 P:1
B
B:1
M M:2
P
M:1
P:2
Build and index to access quickly to the nodes and traverse the tree
q y
Slide 14
Artificial Intelligence Machine Learning
15. Mining the FP-Tree
Properties to mine the FP-tree
p
Node-link prop.: All possible itemsets in which the frequent item
a is included can be found by following a’s node-links
s c uded ca ou d oo g a s ode s
root
F:4 C:1
Item P has support of 3
B:1
F Two paths in the FP-
C:3 B:1
tree for node P
C
{F,C,A,M}
1.
A
A:3 P:1
{C,B,P}
{C B P}
2.
2
B
B:1
M M:2
P
M:1
P:2
Slide 15
Artificial Intelligence Machine Learning
16. Mining the FP-Tree
Prefix path p p To calculate the frequent p
p prop.: q patterns for a node
a in path P, only the prefix subpath of node of node a in P
needs to be accumulated, and the frequency count of every
node in the prefix path should carry the same count as node a
root
Node i i
N d P is involved in:
l di
F:4 C:1
Item (F:4,C:3,A:3,M:2,P:2)
B:1
F Take the prefix of the
C:3 B:1 path until M
C
(F:4,C:3,A:3)
A
A:3 P:1 Adjust counts to 2
B
B:1 (F:2,C:2,A:2)
M M:2
So, F, C, and A co-ocur
P
M:1 with M
P:2
Slide 16
Artificial Intelligence Machine Learning
17. Mining the FP-Tree
Fragment g
g growth: Let α be an itemset in DB, B be α’s
,
conditional pattern base, and β be an itemset in B. Then, the
support α U β is equivalent to the support of β in B.
root
t
F:2
For M, we had
,
(F:2,C:2,A:2)
C:2
(F:1,C:1,A:1,B:1)
Therefore,
A:2
{(F,C,A,M):2},{(F,C,M}:2},
B:1 …
Slide 17
Artificial Intelligence Machine Learning
18. Is FP-growth Faster than Apriori?
As the support threshold goes down, the number of itemsets
increases dramatically. FP-growth does not need to generate
candidates and test them
them.
Slide 18
Artificial Intelligence Machine Learning
19. Is FP-growth Faster than Apriori?
Both FP-growth and A priori scale linearly with the number of
transactions. But FP-growth is more efficient
Slide 19
Artificial Intelligence Machine Learning
20. Next Class
Advanced topics in association rule mining
Slide 20
Artificial Intelligence Machine Learning
21. Introduction to Machine
Learning
Lecture 14
Advanced Topics in Association Rules Mining
Albert Orriols i Puig
aorriols@salle.url.edu
i l @ ll ld
Artificial Intelligence – Machine Learning
Enginyeria i Arquitectura La Salle
gy q
Universitat Ramon Llull