As 7

2011 International Conference on Recent Trends in Information Systems

Online Mining of data to Generate Association
Rule Mining in Large Databases
Archana Singh Megha Chaudhary Dr (Prof.) Ajay Rana Gaurav Dubey
Ph.D Scholar, M.tech(CS&Engg) Ph.d(Comp Science&Engg) Ph.d Scholar
Amity University Amity University Amity University Amity Univeristy
NOIDA (U.P) NOIDA (U.P) NOIDA (U.P) NOIDA(U.P)
91-9958255675 +91-981811756 +919958759459
archana.elina@gmail.com nicemegha@gmail.com ajay_rana@amity.edu gdubey1977@gmail.com

ABSTRACT - Data Mining is a Technology to explore data, Association rule mining, as suggested by R. Agrawal,
basically describes relationships between items in data sets. It
analyze the data and finally discovering patterns from large
helps in finding out the items, which would be selected
data repository. In this paper, the problem of online mining of
provided certain set of items have already been selected. An
association rules in large databases is discussed. Online
improved algorithm for fast rule generation has been
association rule mining can be applied which helps to remove
discussed Agrawal et. al (1994). Two algorithms for
redundant rules and helps in compact representation of rules
generating association rules have been discussed in ‘Fast
for user.
Algorithms for Mining Association Rules’ by Rakesh
In this paper, a new and more optimized algorithm has been
proposed for online rule generation. The advantage of this Agrawal and Srikant (1994).
algorithm is that the graph generated in our algorithm has The online mining of data is performed by pre-processing the
less edge as compared to the lattice used in the existing data effectively in order to make it suitable for repeated
algorithm. The Proposed algorithm generates all the essential online queries. An online association rule mining technique
rules also and no rule is missing. The use of non redundant discussed by Charu C Agrawal at al(2001) suggests a graph
association rules help significantly in the reduction of theoretic approach, in which the pre -processed data is stored
irrelevant noise in the data mining process. This graph in such a way that online processing may be done by applying
theoretic approach, called adjacency lattice is crucial for a graph theoretic search algorithm. In this paper concept of
online mining of data. The adjacency lattice could be stored adjacency lattice of itemsets has been introduced.
either in main memory or secondary memory. The idea of This adjacency lattice is crucial in performing effective online
adjacency lattice is to pre store a number of large item sets in data mining. The adjacency lattice could be stored either in a
special format which reduces disc I/O required in performing main memory or on secondary memory. The idea of
the query. adjacency is to pre-store a number of item sets at a level of
support. These items are stored in a special format (called
Index Keywords: adjacency lattice) which reduces the disk I/O required in
Adjacency lattice, Association Rule Mining, Data Mining order to perform the query.
Online generation of the rules deals with the finding the
I INTRODUCTION association rules online by changing the value of the
Data Mining is a process of analysis the data and minimum confidence value. Problems with the existing
summarizing it into useful information. In other words, algorithm is that the lattice has to be constructed again for all
technically, data mining is the process of finding pattern large itemsets, to generate the rules, which is very time
among dozens of fields in large relational databases. Data consuming for online generation of rule. The number of edges
mining software is one of a number of analytical tools for would be more in the generated lattice as we have edges for a
frequent itemset to all its supersets in the subsequent levels.
analyzing data. It allows users to analyze data from many
This paper aims to develop a new algorithm for online rule
different dimensions or angles, categorize it, and summarize
generation. A weighted directed graph has been constructed
the relationships identified.
and depth first search has been used for rule generation. In the
A. Overview of the Work done proposed algorithm, online rules can be generated by
generating adjacency matrix for some confidence value and
the generating rules for confidence measure higher than that
used for generating the adjacency matrix.

978-1-4577-0792-6/11/$26.00 ©2011 IEEE
126

A new algorithm has been developed to overcome these threshold value. The itemsets obtained above are referred as
difficulties. In this algorithm the number of edges graph prestored itemsets, and can be stored in main memory or
generated is less than the adjacency lattice and it is also secondary memory. This is beneficial in the sense that we
capable of finding all the essential rules. need not to refer dataset again and again from different value
This paper is divided further into sections as : Section 2 of the min. support and confidence given by the user.
describes the work done by Charu C Agarwal(2001). Section The adjacency lattice L is a directed acyclic graph. An itemset
3 describes the new proposed algorithm. Section 4 discusses X is said to be adjacent to an itemset Y if one of them can be
the illustration of Existing and proposed algorithm. In the last obtained from the other by adding a single item. The
para, the comparison between two algorithms with their adjacency lattice L is constructed as follows:
complexity is found. Construct a graph with a vertex v(I) for each primary itemset
I. Each vertex I has a label corresponding to the value of its
II EXISTING ALGORITHM FOR ONLINE support. This label is denoted by S(I). For any pair of vertices
RULE GENERATION corresponding to itemsets X and Y, a directed edge exists
The aim of Association Rule Mining (Rakesh et. al, 1994) is from v(X) to v(Y) if and only if X is a parent of Y .Note that it
to detect relationships or patterns between specific values of is not possible to perform online mining of association rules at
categorical variables in large data sets. Rakesh suggests a levels less than the primary threshold.
graph theoretic approach. The main idea of association rule
mining in the existing algorithm is to partition the attribute STEP 2: Online Generation of Itemsets:
values into Transaction patterns. Basically, this technique Once we have stored adjacency lattice in RAM. Now user can
enables analysts and researchers to uncover hidden patterns in get some specific large itemsets as he desired. Suppose user
large data sets. Here the pre-processed data is stored in such a want to find all large itemsets which contain a set of items I
way that online rule generation may be done with a and satisfy a level of minimum support s, then there is need to
complexity proportional to the size of the output. In the solve the following search in the adjacency lattice. For a
existing algorithm, the concept of an adjacency lattice of given itemset I, find all itemsets J such that v(J) is reachable
itemsets has been introduced. This adjacency lattice is crucial from v(I) by a directed path in the lattice L, and satisfies S(J)
to performing effective online data mining. The adjacency ≥ s.
lattice could be stored either in main memory or on secondary STEP 3: Rule Generation :
memory. The idea of the adjacency lattice is to prestore a Rules are generated by using these prestored itemsets for
number of large itemsets at a level of support possible given some user defined minimum support and minimum
the available memory. These itemsets are stored in a special confidence.
format (called the adjacency lattice) which reduces the disk
I/O required in order to perform the query. In fact, if enough III PROPOSED ALGORITHM
main memory is available for the entire adjacency lattice, The algorithm by Charu et al. (2001) is discussed in previous
then no I/O may need to be performed at all. section. Detailed discussion of the proposed algorithm has
been done in the current section. Graph theoretic approach
A Adjacency lattice
has been used in the proposed algorithm. The graph generated
An itemset X is said to be adjacent to an itemset Y if one of is a directed graph with weights associated on the edges. Also
them can be obtained from the other by adding a single item. the number of edges is less compared to that in the algorithm
Specifically, an itemset X is said to be a parent of the itemset
suggested by Charu et. al.
Y, if Y can be obtained from X by adding a single item to the
set X. It is clear that an itemset may possibly have more than A. Algorithm
one parent and more than one child. In fact, the number of The algorithm has two steps explained below. The first step is
parents of an itemset X is exactly equal to the cardinality of explained in the section 3(A) in which we will explain that
the set X. This observation follows from the fact that for each how are we going to construct the graph. The second step is
Element ir in an itemset X, X -ir is a parent of X. In the lattice explained in section 3(B) in which rule generation is
if a directed path exists from the vertex corresponding to Z to explained.
the vertex corresponding to X in the adjacency lattice, then Construction of adjacency lattice
X Z. In such a case, X is said to be a descendant of Z and Z
is said to be an ancestor of X.
B. The Existing Algorithm
There are three steps in the Existing algorithm explained
by (Agarwal et al. 1994)
STEP 1: Generation of adjacency lattice:
The Adjacency lattice is created using the frequent itemsets
generated using any standard algorithm by defining some
minimum support. This support value is called primary

127

The large itemsets obtained by applying some traditional //finding all subsets of item1 in s(i+1,j)
algorithm for finding frequent itemsets (like Apriori) are For each itemset in s(i+1) do
stored in one file and corresponding values of support is Item2 = s(i+1,k).itemsets;
stored in another file. Using these two files we can store the If (item1 is superset of item2)
item and their corresponding support in a structure say S. Index2 = find_index(item2,3);
Now create an array of structure s(i, j) having two fields Confidence = s(index2).support/s(index1).support
itemsets and support. This array of structure is used to store If(Confidence >= minconf)
the different length of large itemsets in different dimensions. adj_lat(index1,index2)=Confidence;
In the field itemsets of structure s(i, j) we will store 1-itemsets Return adj_lat;
in s(1, j), 2-itemsets in s(2, j), 3-itemsets in s(3, j) and so on. End;
We have written a function for this purpose named In the above gen_adj_lattice() function there is a sub-function
as Initialize ( ). The pseudo code for the Initialize ( ) to search an element in the structure S which returns the
index of that itemset in the structure. Using this index we can
Algorithm Initialize (S) get the support of the corresponding large itemset.
Begin for each large itemset Є S do Let an itemset X is to be searched in the S(i) firstly find the
Item1 = s(i).itemset; length of the itemset X. Now take start traversing the
Item2 = s(i+1).itemset; structure S if the length of the current itemset is equal to the
M1 = length(item1); M2 = length of the itemset to be searched then only compare the
length(item2); two itemsets. If all the items of the both itemsets are matching
s(j,k).itemsets = item1; then return the index. This pseudo code for find_index() is
s(j,k).support = given in the following:
s(i).support; Increment k;
If(diff of lengths of consecutive items!=0) Algorithm find_index(item,S)
put itemsets in the next row of s; Begin
return N1 = length(item1);
s; End; For each itemset in S do
Now to calculate the weight of the edge between itemset X Item2= S( r).itemsets;
and itemset Y, where (X-Y) = 1-itemset, calculate the value N2 = length(item2);
support(X)/support(Y) if this value is >= minimum If(length of the itemsets are equal)
confidence then we can have an edge between the itemset X If(Each item matched)
and the itemset Y and this edge will have weight = index= r;
support(X)/support(Y). Now a function is required to generate return index;
the adjacency matrix using the structure S and s. This End;
function will take one large itemset from s(i, j) and compare The graph generated will be directed graph in which largest
with all the items in s(i+1, j). If any subset of this itemset in itemsets will be at the first level and 1-large itemsets will be
s(i, j) is present in s(i+1, j) then it is required to find that at lowest level. And the direction of the edges will be from
whether there will be link between them and if there will be (n-1)th level to nth level. And the weight will be equal to
link then what will be the weight of the link. the support of the itemset in the (n-1)th level divided by the
Let an itemset X from structure s(i, j) is taken and searched in support of the itemset at the nth level.
the S(i). When index of itemset X, say index 1, in the B. Generation of Rules
structure S is obtained, we can easily get the support of this
itemset X. Now search all subset of this itemset in s(i+1, J). Each node in the directed graph is chosen for rule generation.
There is need to find the support for each itemset Y, which is Call that node starting node and do depth first search in the
present in the s(i+1, j) and also subset of the itemset X directed graph. And generate the rules from the visited node
present in s(i, j), The index of the itemset Y, index2, is and starting node if and only if it satisfies all the condition,
obtained by searching it in structure S(i). Now weight = which are required to generate essential rule.
S(index1).support/S(index2).support is calculated if it is Conditions:
greater than or equal to minimum confidence then in the 1. Product of the confidence of the path between the
adjacency matrix,say a, a[index1, index2] is assigned value starting node and the visited mode must be greater
equal to the weight. The pseudo code for gen_adj_lattice()B than or equal to minimum confidence.
is given in the following 2. To reduce simple redundancy: We generate set of all
children of the visited node and then this set of child
Algorithm gen_adj_lattice(S,s) nodes is compared with the nodes that have already
Begin been used by the same starting node for rule
For each row of s do generation. If any one of the child nodes is found
Item 1 = s(I,j).itemsets; there from this visited node no rule can be generated.
Index1 = find_index(item1,s); Since, this rule will be redundant.
The pseudo code for find_allChild() is given

128

Algorithm find_allChild(adj_lat,i) Algorithm Generate Rule (Starting node: X, Visited node:
Begin Y, Min Conf: c, G)
C1=C=NULL; Begin
C1=C=child(adj_lat,i); RuleSet=NULL;
while C1 = NULL do C1=weighted product of the path(X,Y);
For each c Є C1 do If(c1>=c)
C1 = Child(adj_lat,c); If(~compare(find_allChild(adj_mat,Y),node_gen_rule(X,G)))
C = C Ụ C1; If(~compare(node_gen_rule(find_allParents(adj_lat,X),G),Y)
return C; )
End; If(~compare(find_allChild(adj_lat,Y),node_gen_rule(find_all
We have a structure, say G, which stores nodes that have Parents(X)),G))
already been used for generating rules. They are stored in RuleSet = RuleSet U(Y->(X-Y));
such a way that we can get the required nodes just by Return ruleSet;
reaching the corresponding index. The pseudo code for the End;
same is given in the following
Algorithm node_gen_rule(nodeset: IV. ILLUSTRATION OF EXISTING AND
S,G) Begin PROPOSED ALGORITHMS
generated Set = NULL; Now we are going to illustrate both the algorithms by
for each node S(i) Є S do
taking example. The Market Basket Data sets taken
generated Set – generated Set Ụ G(S(i));
shown below in Table 4.1. This dataset has five
return generated Set;
transactions and five itemsets. Let the minimum
End;
support be 0.4 and minimum confidence is 0.67.
To reduce strict redundancy; Various large itemset obtained b having support value
A) We have generated s of all Parents of the starting greater than 0.4, along with the support value are
node and then for all these parent nodes we have to shown in the Tables 4.2 to 4.4.
find out all the nodes which have been used for Rule Table 4.1 : 1-large itemsets
Generation by these parent nodes. Then this set of ITEMS SUPPORT
node is compared with the visited node. If this
visited node is found then from this visited node no A=Bread 0.8
rule can be generated. Because this rule will be
strictly redundant. The pseudo code for B=Milk 0.8
find_allParents() is given in the following
C=Beer 0.6
B) We generate set of all Childs of the visited node and
the set of all Parents of the starting node and then D=Diaper 0.8
for all these parent’s nodes we have to find out all
the nodes, which have been used for Rule F=Cock 0.4
Generation by these parent nodes. Then this set of
Table 4.2 : 2-large itemsets
node is compared with the set of all child. If any of
child of this visited node is found there then from AB 0.6
this visited node no rule can be generated. Because
this rule will be strictly redundant. AC 0.4

Algorithm find_allParents(adj_lat,i) AD 0.6
Begin
BC 0.4
P1=P=NULL;
P1=P=Parents(adj_lat,i) BD 0.6
While P1 is not equal to NULL
do For Each P Є P1 dp BF 0.4
P1 = Parents(adj_lat,P) P = P Ụ P1;
return P; CD 0.6
End;
DF 0.4

129

Table 4.3 : 3- large itemsets

ABD 0.4

ACD 0.4

BCD 0.4

BDF 0.4

A. Rule Generation from the proposed algorithm
Weights of edges between frequent 1-itemset to frequent 2-
itemset and between frequent 2-itemset to frequent 3-itemsets Figure 4.1 Lattice Structure
are shown in Table 4.4 . The weights of edges are calculated
in the following manner. Let X be k-itemset and Y be the k+1 The resultant graph is shown below:
itemset, then the weight of the edge form X to Y is equal to
the confidence of the rule X (Y- X)

Table 4.4: Weights of the edges between 1-itemset to 2-itemsets

Edges Weights
A – AB 0.75
A – AC 0.5
A – AD 0.75
B – AB 0.75
B – BC 0.5
B – BD 0.75
B – BF 0.5 Figure 4.2: Graph generated for the rule generation
C – AC 0.67
We can see that there are more edges in the lattice generated for
C – BC 0.67
C – CD 1.0 the same example. These edges are shown by dotted edge.
D – DF 0.5
D – AD 0.75
D – BD 0.75
D – CD 0.75
AB-ABD 0.67
AC-ACD 0.67
AD-ABD 0.67
AD-ACD 0.67
BC-BCD 1.0
BD-BCD 0.67
BF-BDF 1.0
CD-ACD 0.67 Figure 4.3 Generating the rules for the large item sets ABD
CD-BCD 0.67
BF-BDF 1.0 Applying depth first search starting from the node ABD, the
The lattice generated for the above example: node A will be the first visited node but the weighted product
(0.67*0.75) of the path obtained from A to ABD is less than
minimum confidence. So the node A will not participate in
rule generation. Node B will be second visited node but this
also will not participate in rule generation because of similar
reason. Now the next visited node is AB and the weighted
product of the path from AB to ABD is 0.67 which is equal to
the minimum confidence. The children nodes of AB are not
generating any rule, and also AB is not used by any of the
parent nodes of ABD. Thus all the three conditions are
satisfied for rule generation. So we will generate the rule

130

from AB. , AB = > D. Now the next visited node will be D, .
but weighted product of the path from D to ABD is less than Theorem: The number of edges in the adjacency lattice is
minimum confidence hence no rule will be there and we have equal to the sum of the number of parents of each
to go to next visited node AD as satisfies all the three primary itemset.
conditions so there will be rule , AD => B . Let N(I, s) be the number of primary itemsets in R(I, s). Thus
The next visited node will BD and this node satisfies size of output in this case = N(I, s) . h(I, s) Complexity of
all the three conditions, thus we have rule , BD => A. existing algorithm is proportional to N(I, s) . h(I, s).In the
proposed algorithm there are some edges left which are not
Similarly, Generating the rules for large itemset ACD, BCD,
visited by their parents Let those nodes are denoted by L(I,
BDF,BDF,AB,AD,BD,AC,BC,CD,BF,DF. We are getting the s).This size of output in this case = N(I, s) . h(I, s) – L(I, s)
following rules shown in the Table 4.5 below Complexity of proposed algorithm is proportional to N(I, s) .
Table 4.5: The rules generated h(I, s) – L(I, s)

1 AB => D CONCLUSION AND FUTURE WORK
2 AD => B
3 BD => A In this paper, data mining and one of important technique of
4 C => AD data mining is discussed. The issues related with association
5 AD => C rule mining and then online mining of association rules are
6 C => BD introduced to resolve these issues. Online association mining
7 BD => C helps to remove redundant rules and helps in compact
8 F => BD representation of rules for user. A new algorithm has been
9 BD => F proposed for online rule generation. The advantage of this
10 A => B
algorithm is that, the graph generated in our algorithm has
11 B => A
less edge as compared to the lattice used in the existing
12 A => D
13 D => A
algorithm. This algorithm generates all the essential rules also
14 B => D and no rule is missing.
15 D => B The future work will be implementing both existing and
16 D => C proposed algorithms, and then test these algorithms on large
datasets like the Zoo Dataset, Mushroom Dataset and
Synthetic Dataset.
B. Rules Generated from the Existing algorithm
REFERENCES
Generating the rules for large itemsets ABD
Chose all the ancestors of ABD which has support less than [1] Agrawal, R., Imielinski, T., Swami, A. “Mining association
rules between sets of items in large databases.’’ SIGMOD-1993,
or equal to the pp. 207-214.
Value = {support (ABD)/c} = (0.4 / 0.67) = 0.6 [2] Charu C. Agrawal and Philip S. Yu, “A New Approach to
AB, AD and BD will be selected. So we will have following Online Generation of Association Rules’’ IEEE, vol. 13,
No. 4, pp. 327-340, 2001.
lattice. We can easily see that AB, AD and BD are the
[3] Dao-I Lin, Zvi M.Kedem, ``Pincer search: An Efficient
maximal ancestor of the directed graph shown in the figure. algorithm to find maximal frequency item set IEEE trans
Hence we will have two rules: On knowledge and data engineering, vol. No.3,
AB => D , AD => B, BD => A pp. 333-344, may/june,2002.
[4] Ming Liu, W.Hsu, and Y. Ma, “Mining association rules
with multiple minimum supports.’’ In Proceeding of
fifth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 337-341,
N.Y., 1999. ACM Press.
[5] R. Agrawal, T. Lmielinksi, and A. Swami
“Mining association between sets of items in
Large databases’’ Conf. Management of
Data, Washington, DC, May 1993.
[6] Ramakrishna Srikanth and Quoc Vu and
Figure 4.4 : Directed Graph in Adjacency Rakesh Agrawal, ‘’Mining association rules
with itemsets constraints.’’ In Proc. Of the 3rd
Total number of 16 rules generated in both algorithms. It was International Conference on KDD and Data
Mining (KDD 97), Newport Beach, California,
found that no essential rules are missing in proposed August 1997.
algorithm and also there is no redundancy in the rules [7] Rakesh Agrawal and Ramakrishna Srikanth, “Fast
Algorithm for Mining Association Rules’’ In Proc.
generated. 20 Int Conf. Very Large Data Base, VLDB, 1994.
[8] Data Mining: Concepts and Techniques By “Jiawei
C. Comparison of Algorithms: Han Micheline Kamber’’. Academic press 2001.
Complexity of graph search algorithm is proportional to the
size of output

131

As 7

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a As 7

Semelhante a As 7 (20)

As 7