This document proposes using subjective measures like profitability and loyalty, in addition to objective measures like support and confidence, for sequential pattern mining to identify potentially profitable customer groups.
It discusses limitations of existing sequential pattern mining approaches that focus only on frequency. The proposed approach incorporates emerging patterns, profit constraints, total monetary value, compactness, and recency to identify sequential patterns and customer segments that are truly useful for businesses.
Simulation studies on real-world datasets show the proposed approach using multiple measures outperforms existing methods in terms of runtime, memory usage, and ability to find meaningful patterns for targeting potential customers. Evaluation of various interestingness measures is also discussed.
2. 509
included in the pattern have low profits, the pattern which has not taken place recently and the pattern not
actively appeared for certain span of duration. To address these issues, we incorporate the subjective
parameters like loyalty and profitability along with objective parameters like support and confidence in SPM.
II. LITERATURE STUDY
Massive amounts of business data are being generated and stored every day in database. Mining association
rules [23] from transactional data was popular and important knowledge discovery technique [20] in
90s.Association rules (ARs) of retail data can provide valuable information on customer buying behaviour.
But there are some applications which need to trap time related phenomenon called timestamp based
sequential data. The methodology was first introduced by Agrawal and Srikant , in which, Consider a dataset
consisting of ‘‘data-sequences”, which are lists of items purchased by individual customers over time. The
goal of SPM is to find all the frequent sub sequences in the dataset [2].
A. Sequential Pattern Mining (SPM)Technique
Existing SPM techniques are divided into two categories: (1)Apriori based SPM and (2) FP-growth based
SPM.[1][13]
(1) Apriori based SPM
The GSP algorithm is an extension of the A-priori model worked on a breadth-first principle which uses
“Generating-Pruning” method [25].SPADE needs only three database scans in order to extract the sequential
patterns. The main idea in this method is a clustering of the frequent sequences based on their common
prefixes and the record of the candidate sequences, loaded in main memory[27]. SPAM proposed a vertical
bitmap representation of the database for both candidate representation and support counting, which represent
the database in the main memory [3].
(2) FP-growth based SPM
FreeSpan algorithm considering the pattern-projection method for mining sequential patterns. It is the
original approach for mining sequential patterns recursively projecting the data sequences into smaller
databases.[12] This work has been continued with PrefixSpan [22]. The projected databases contain suffixes
of the data-sequences from the original database, grouped by prefixes.
B. Limitation with existing approach and variation in RFM Model
Above mentioned methods are worked on purely frequency. But the rare and high valued items which are
indeed important are neglected. Concept of recency, frequency, and monetary (RFM) introduced by Bult and
Wansbeek in SPM.[ 6] and has proven very effective [4] when applied to marketing databases. Apriori based
efficient algorithm for finding RFM based sequential patterns from customers’ data-sequences [26].Several
researchers have considered RFM variables in developing prediction and classification models. A Bayesian
Networks approach has been proposed, using RFM variables to predict a customer’s response to direct
marketing [10].Data-mining models for predicting customer loyalty [8] and customer lifetime value [11] has
been developed. Many data-mining applications have been developed to discover useful customer and market
information from the data, such as product recommendation [17], e-retailing [9], customer profiling [14][19].
To the best of our knowledge, however, this paper is the first in applying the subjective measures in SPM. As
discussed in the Introduction, to identify group of emerging customers could be very important for retailer,
which motivates this research.
C. Motivation of research
The main aim of any business: “To achieve maximum profit, every businessman is interested to identify such
crowd of customer who helps to accomplish this basic requirement.” This fundamental but strong reality
motivates us to develop such approach which fully or at least partially helps businessman to discover such a
potential buying pattern and group of potential customer to run forward their business.
III. NOVEL PROPOSAL
Problem of Conventional SPM approaches in Business environment is investigated in section 2. As discussed
in Section 1, the patterns we want to discover in Business environment are not only Recent, Frequent and
high valued. But, we are in search of such buying patterns which are really useful in business. And fulfil the
3. 510
fundamental aim of businessman, discussed in section 2. According to fundament aim of any business,
pattern should be profit achiever.
A. Limitation of existing model and features of our proposal
Most of the existing SPM model like GSP, SPAM, SPADE, Freespan, PrefixSpan works on purely
frequency, which is suffering from rare item problem. Proposed approach worked on other objective
measures along with frequency and constraints.
Above problem is resolve by RFM model [26]. To extract those items which having more monitory
value are not sufficient because monitory value gives you total sales. But how much one is gaining from
total sales is more significant which is known as profit. Consideration of profit is always more
meaningful than Sales or Monitory, which is first time introduced in our approach.
Almost all the existing methods are concentrate on the current scenario but there are some patterns which
having potential to become strong in future, which are suffering from slightly less support. Minor change
in support value can put such pattern in attention, which can be potential buying pattern for tomorrow.
Such kinds of Emerging Patterns (EPs) are well focused in our proposed approach.
Customer segmentation is taken place on bases of RFM parameters, in most of the existing work.
According to our knowledge nobody has identified potential customer group, which can be easily
identified by proposed approach.
According to our knowledge extraction of sequential patterns work on objective measures like support
and confidence. Little work has been done with consideration of other measures like lift, correlation,
conviction, leverage etc in SPM. As per our survey, SPM with subjective measure is almost untouched
area, which really needs to explore. Proposed approach focused on subjective measures like profitability,
loyalty, simplicity etc.
In most of the research customer who are recent, frequent and having high monitory value are considered
as loyal customer. But loyalty of customer not indeed depend on high RFM. Because in customer
relationship management (CRM), long term customer is more important than the customer who has
started recently to purchase from shop. Along with recency active trade during certain duration of span
which is known as compactness is equally important. Some customer are buying only household items
and moving somewhere else for high budget items like electronics items, jewellery, cloths, grocery etc.
So, it is also necessary to keep track of customer’s buying basket, which should fill with diverse items is
also important. Our proposed approach
B. Formal definition
Base Algorithm: FP growth based Prefixspan can be chosen as base algorithm for modification. Because
Theoretical (section 2) and simulation study (section 4) reveal that FP Growth based PrefixSpan outperform
Apriori based GSP[25], SPAM[3], SPADE[27] algorithm [15][22][21].
Representation of Data Sequence:Data-sequence A is represented as <(A1 a1(qty1), t1, m_sold1,m_pur1),
(A2 a2(qty2), t2, m_sold2,m_pur2), …,(An an(qtyn), tn, m_soldn,m_purn)>, where (Aj aj(qtyj), tj,
m_soldj,m_purj) means that item aj is purchased at time tj with m_purj money and in qtyj quantity which
having original value m_soldj and its having of type Aj , 1 j n, and tj-1 j for 2 j n. In the data-
sequence, if items occur at the same time, they are ordered alphabetically.
Profitable Pattern: It is important for any business to understand which patterns are profitable in terms of
money. profit is indirectly derived from Monitory constraint with some changes. Profit is depends upon two
valuable parameters: purchase price and sold price.
The Profitable constraint define item in a sequence must be more than the defined threshold value. The
Profitable constraint is formally represented as following:
CProfit - (1)
{
A sequence SS=< (q1(qty1) , t1,M_Sold1, M_Pur1 ), (q2(qty2) , M_Sold2, M_Pur2 ),...., (qm(qtym) ,M_Sold m,
M_Pur m ) > is said to be a subsequence of S only if, (1) itemset SS is a subsequence of S , SS S and (2) the
number of items in S should satisfied
( _ _ )
TProfit (2)
4. 511
Loyalty: Along with Recency, frequency and Monitory (RFM), pattern should fit in compactness criterion.A
constraint C for sequential pattern mining is a boolean function C on the set of all sequences. The problem
of constraint-based sequential pattern mining is to find the complete set of sequential patterns satisfying a
given constraint C. Constraints are design as follows[16]:
Total Monitory: The Total Monitory (TM) constraint define item in a sequence must be more than the
defined threshold value. The Total Monitory constraint is formally represented as following:
CTM (3)
{
A sequence SS=< (q1(qty1) , t1,M_Sold1, M_Pur1 ), (q2(qty2) , M_Sold2, M_Pur2 ),...., (qm(qtym) ,M_Sold m,
M_Pur m ) > is said to be a subsequence of S only if, (1) itemset SS is a subsequence of S , SS S and (2) the
number of items in S should satisfied
( _ )
TM (4)
Compactness: it derived from duration constraint. The time-stamp difference between the first and the last
transactions in a sequential pattern must be longer or shorter than a given period. Formally, a duration
constraint is in the form of
CComp (5)
where , and t is a given integer. A sequence satisfies the constraint if and only if
SDB| 1<···<ilen( ) ) s.t. 1],..., len( ) len( )] .time - 1]
Recency : sequential patterns in the sequence database must have the property such that the last timestamp of
sequence must be longer or shorter than given recency count.
Formally , Recency constraint is in the form of
C recency (6)
where , and t is a given integer. A sequence s SDB|
i1<···<ilen( ) ) s.t. 1],..., len( ) len( )
C. Evaluation of association rules
Limitation of existing SPM algorithm w.r.t objective measures: SPM algorithms use support and confidence
thresholds as objective parameters which lead to produce a huge number of rules which may not be really
interesting to user.
Generated rules are valid if they satisfy some evaluation measures. Evaluation process is needed to handle a
measure in order to evaluate its interestingness. In our approach, we propose to evaluate interestingness of
mined rules and to express the relevance of rules with following measures.[5][24][18] where, itemsets A, B
and rule X: A B as follows (refer table 1):
TABLE I: COMPREHENSIVE STUDY OF MEASURES
Measure Mathematical Formula Working Understanding
Lift
( ) =
P(A B)
P(A) P(B)
Represent probability of having B when A
occurs.
High value: stronger associations
Low value: weak associations.
Loevinger Loevinger(X) =
1
P(A) P( B)
P(A B)
It normalizes the centred confidence of a
rule according to the probability of not
satisfying its consequent part B.
High value: stronger associations
Low value: weak associations.
Conviction ( )
=
1 supp(B)
1 conf(A B)
It is interpreted as the ratio of the expected
frequency that A occurs without B
(Incorrect prediction).
It attempts to measure the degree of
implication of a rule.
leverage leverage(A -> B) = P(A
and B) - (P(A)P(B))
It is a measure in which number of counting
is obtained from the co-occurrence of the
antecedent and consequent of the rule from
the expected value.
it find out how many more units
(items A and B together) are sold than
expected from the independent sells.
5. 512
IV. SIMULATION STUDY
We have performed a simulation study on secondary real-time dataset. SPM Algorithms were implemented
in Java and tested on an Intel Core Duo Processor with 2GB main memory under Windows XP operating
system.
A. Simulation study of existing SPM techniques
We have performed a simulation study to compare the performances of the algorithms: GSP,SPADE,SPAM
and PrefixSpan, Comparison is based on runtime, frequent sequence patterns, memory utilization on various
(10 % to 60%.) support threshold. We have performed following experiment on JAVA based SPMF
framework (Sequential Pattern Mining Framework) designed by Philippe (Sequential Pattern Mining
Framework : http://www.philippe-fournier-viger.com/spmf ) on real time dataset mashroom.
On comparing various algorithms of sequential pattern mining algorithm. The following points can be
observed from above simulation:
Approx 49% and 24% more execution time is taken by GSP and SPADE w.r.t. prefixSpan. SPAM is
consuming 18% less execution time to generate sequential patterns.(refer fig1)
Almost same frequent sequences are generated for 50% and above support count. Same sequences are
generated with SPAM and PrefixSpan in all the cases. 10% and 11% more sequences are generated by
GSP and SPADE respectively. (refer fig 2)
Comparatively less memory is occupied by GSP and SPADE w.r.t. PrefixSpan.11% less memory is
occupied by SPAM w.r.t. PrefixSpan. (refer fig 3)
Fig.1: memory Vs support Fig.2: no of pattern vs. support
Fig.3: execution time vs. Support
B. Rules generation for various Measures
Performance of various measures based on results obtained using WEKA on real time dataset contact lanse.
Here we have arranged the rules respect to various measures. Also we have observed the values of other
measures for the each rule.(refer fig 4,5,6)
Performance of various measures based on results obtained using WEKA on real life dataset super market for
FP-Growth method (refer table 2 and fig. 7):
Following observation can be made for above experiments:
6. 513
Leverage is highly associated with confidence value. For top five rules in list of lift having high
confidence value(1). Conviction values are also almost in decreasing order.
The rule ’ tear-prod-rate=reduced ==> contact-lenses=none’ having conviction value 4.5 is on top for
leverage major and conviction major table. Also its having confidence value 1 which is highest in list of
confidence.
Fig 4: Top 10 rules respect to conviction Fig 5: Top 10 rules respect to lift
Fig 6: Top 10 rules respect to leverage
Only 0.25% of rules are generated as compared to lift and conviction. Lift and conviction measures are
giving vast range of rules. So decision maker can observe all possible association which is also useful in
some application. Confidence measure is giving precise range of rules. It emphasise on strong rule.
TABLE II : NUMBER OF RULES GENERATED FOR VARIOUS MEASURES
Measure Generated Rules
Lift 181292
Confidence 455
Conviction 181291
7. 514
C. Simulation study of Emerging Patterns for various threshold values
There are some patterns which are not strong currently because of its slightly high support value, which
having potential to become strong by changing its support values and the pattern which are lies on boundary
can be selected. Here we have done experiment by changing threshold by 1% and 2%.(refer fig 8,fig 9,fig10)
Fig 7: Number of rules generated for various measures (ref table 2)
Fig 8: execution time for various support values fig 9: frequent sequence for various support
Fig 10: memory occupied vs. support
Changing boundary threshold by 1% and 2% in support threshold of 30% is finding 13% and 27% more
patterns which are potential but not yet discovered. Same way 23% and 12% more potential patterns are
investigate for 20% support. Discovering more patterns is taking more execution time by 13.5% and 17% for
1% and 2% less boundary value respectively. Memory consumption is almost same by difference of 0.1%-
1%.
D. Simulation for time span window
Here we have done experiment, how specific duration span is giving user specified interesting pattern. We
have done experiment on conventional Apriori method and time window based Apriori method.
8. 515
More no of sequential rules are generated by conventional Apriori based method w.r.t. time window based
Apriori algorithm (refer fig.11).Less sequential patterns are generated by reducing time window size from 4
to 2 (refer fig.12). More execution time is taken by conventional PrefixSpan w.r.t time window based (refer
fig.13).
Fig 11: support vs. Sequential pattern fig 12: support vs. Execution time (ms)
Fig 13: support vs. Memory (mb)
V. CONCLUSION
Comparatively less work has been done in area of emerging customer. Most of the researchers have focused
either on frequency alone or Recency, Frequency and Monitory (RFM) as an evaluation parameters for SPM
and customer evolution which are not sufficient; here we have evaluated more vital parameters which are
essential for classification of customer. In our approach identification of new generation customers taken
place based on subjective measures like profitability and loyalty with SPM. Technique recognizes next
generation customer with the help of PrefixSpan based Emerging Patterns (EPs) in sequential Mining.
REFERENCE
[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int’l Conf. Very Large
Data Bases (VLDB ’94), pp. 487-499, Sept. 1994
[2] Agrawal R. And Srikant R. ‘Mining Sequential Patterns.’, In Proc. of the 11th Int'l Conference on Data Engineering,
Taipei, Taiwan, March 1995
[3] AYRES, J., FLANNICK, J., GEHRKE, J., AND YIU, T., ‘Sequential pattern mining using a bitmap representation’,
In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-
2002.
[4] Blattberg et al., 2008 Blattberg, R.C.; Kim, B-D. & Neslin, S.A. (2008). Database Marketing: Analyzing and
Managing Customers, Chapter 12, pp. 323-337, Springer, ISBN: 978-0387725789, New York,USA.
[5] Brijs, T., Vanhoof, K. and Wets, G. (2003), ‘Defining interestingness for association rules’, International Journal of
Information Theories and Applications 10(4), 370–376.
[6] Bult, J. R., and Wansbeek, T. J. Optimal selection for direct mail. Marketing Science, 14, 4, 1995, 378–394.
[7] C K Bhensdadia, Y P Kosta,’ An Efficient Algorithm for Mining Frequent Sequential Patterns and Emerging
Patterns with Various Constraints’, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-
2307, Volume-1, Issue-6, January 2012
9. 516
[8] Cheng, C. H., and Chen, Y. S. Classifying the segmentation of customer value via RFM model and RS theory.
Expert Systems with Applications, 36, 3, 2009, 4176–4184.Cheng and Chen 2009
[9] Chen, Y. L., Tang, K., Shen, R. J., and Hu, Y. H. Market basket analysis in a multiplestore environment. Decision
Support Systems, 40, 2, 2005, 339–354.
[10] Cui, G., Wong, M. L., and Lui, H. K. Machine learning for direct marketing response models: Bayesian networks
with evolutionary programming. Management Science, 52, 4, 2006, 597–612.Cui et al. 2006
[11] Etzion, O., Fisher, A., and Wasserkrug, S. e-CLV: A modeling approach for customer lifetime evaluation in e-
Commerce domains, with an application and case study for online auction. Information Systems Frontiers, 7, 4–5,
2005, 421–434.Etzion et al. 2005
[12] Han J., Dong G., Mortazavi-Asl B., Chen Q., Dayal U., Hsu M.-C.,’ Freespan: Frequent pattern-projected sequential
pattern mining’, Proceedings 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), 2000, pp. 355-359.
[13] J. Han, J. Pei, and Y. Yin, ‘Mining Frequent Patterns without Candidate Generation’,Proc. 2000 ACM-SIGMOD
Int’l Conf. Management of Data (SIGMOD ’00), pp. 1-12, May 2000.
[14] Hu, H. L., and Chen, Y. L. Mining typical patterns from databases. Information Sciences, 178, 19, 2008, 3683–3696.
[15] Irfan Khan, Anoop Jain,’ Comprehensive Survey on Sequential Pattern Mining’, International Journal of
Engineering Research & Technology (IJERT) Vol. 1 Issue 4, June – 2012
[16] Jian Pei, Jiawei Han, Wei Wang, “Constraint-based sequential pattern mining : the pattern growth methods”, J Intell
Inf Syst , Vol. 28, No.2, pp. 133 –160 , 2007
[17] Lawrence, R. D., Almasi, G. S., Kotlyar, V., Viveros, M. S., and Duri, S. S. Personalization of supermarket product
recommendations. Data Mining and Knowledge Discovery, 5, 1–2, 2001, 11–32.
[18] LIQIANG GENG AND HOWARD J. HAMILTON,’Interestingness Measures for Data Mining: A Survey’ ACM
Computing Surveys, Vol. 38, No. 3, Article 9, Publication date: September 2006.
[19] Mahdavi, I., Cho, N., Shirazi, B., and Sahebjamnia, N. Designing evolving user profile in e-CRM with dynamic
clustering of Web documents. Data and Knowledge Engineering, 65, 2, 2008, 355–372.
[20] Ming-Syan Chen, Jiawei Han, and Philip S. Yu. Data mining: An overview from a database perspective. IEEE
Transactions on Knowledge and Data Engineering, 8(6):866–883, December 1996.
[21] Desai Niti , Dr.Amit Ganatra, ’Sequential Pattern Mining Methods: A Snap Shot’, IOSR Journal of Computer
Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 10, Issue 4 (Mar. - Apr. 2013), PP 12-20
[22] J. Pei, J. Han, B. Mortazavi-Asi, H. Pino, ‘PrefixSpan: Mining Sequential Patterns Efficiently by Prefix- Projected
Pattern Growth’, ICDE'01, 2001.
[23] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large
databases. In SIGMOD-93, pages 207–216, May 1993.
[24] Ramaswamy, S., Mahajan, S. and Silberschatz, A. (1998), On the discovery of interesting patterns in association
rules, in ‘Proceedings of the 24rd International Conference on Very Large Data Bases’, Morgan Kaufmann
Publishers Inc., pp. 368–379.
[25] Srikant R. and Agrawal R.,’Mining sequential patterns: Generalizations and performance improvements’,
Proceedings of the 5th International Conference Extending Database Technology, 1996, 1057, 3-17.
[26] Yen-Liang Chen , Mi-Hao Kuo , Shin-Yi Wu, Kwei Tang , ‘Discovering recency, frequency, and monetary (RFM)
sequential patterns from customers’ purchasing data’, Electronic Commerce Research and Applications 8 (2009)
241–251
[27] M. Zaki, ‘SPADE: An Efficient Algorithm for Mining Frequent Sequences’, Machine Learning, vol. 40, pp. 31-60,
2001.