1. INDIAN INSTITUTE OF TECHNOLOGY ROORKEE
A Study of Scalable Pattern Mining Algorithms
on Large Scale Interval Data
Under Supervision Of:
Dr. Dhaval Patel
CSE Department
Presented By:
Prakhar Dhama
15535029
2. 2
Outline
• What is Pattern Mining?
• Need for Scalable Pattern Mining
• Interval-based Events
• Serial Frequent Itemset Mining
– Apriori, Eclat and FP-growth
• Parallel Itemset Mining
– FP-growth based PFP
– Ultrametric tree based FiDoop
• Pattern Mining on Interval Data
– Interval Sequences
– Temporal Relations
– Heirarchical Representation
• Conclusion and Research Gap
3. 3
What is Pattern Mining?
• A pattern can be a set of items, ordered subsequences,
subgraphs, etc.
• Different kinds of pattern mining are
– Frequent itemset mining. finding set of items that frequently appear
together in a transactional database, such as milk and bread.
– Sequential pattern mining. finding frequently occurring
subsequence in a sequence database, such as customer buying
pattern, first a digital camera, followed by a memory card.
– Structured pattern mining. finding frequent substructures in a
spatial database such as graphs, trees, or lattices.
– Temporal pattern mining. finding relations among events in a
temporal database such as time for which iron is on and time for
which its steel base is hot.
4. 4
Need for Scalable Pattern Mining
• The NSA Utah Data Center store data in order of exabytes.
In 2014, NSA processed 29 petabytes of data in a single day.
• With huge increase in data size, pattern mining on single
machine is infeasible.
• Solution. Modify existing pattern mining algorithms and
design scalable versions which can run on distributed
means.
• The parallel programming models are MapReduce, Bulk
Synchronous Parallel, etc.
• Some of the popular big data tools to implement parallel
algorithms are Apache Spark, Hadoop, NoSQL databases
like Cassandra, MongoDB, etc.
5. 5
Interval-based Events Data
• In real world events, instead of being instantaneous, persist
for some duration and called interval events.
• The data including time related attribute is stored in temporal
database.
• The relation among these interval events is intrinsically
complex and point-based algorithms are not applicable.
• Applications
– power meter in house that logs household appliance electricity usage,
can be used to identify times each appliance is turned on or off.
– it has been observed that in diabetic patients, the presence of
hyperglycemia overlaps with the absence of glycosuria.
– domains such as medical, multimedia, meteorology and finance
where the events durations could play an important role.
6. 6
Large Scale Interval Data
• Querying vs Mining. The purpose of mining is to discover
knowledge while database querying simply retrieves data.
• The only work that deals with large scale interval data is
querying quantitative analysis[1].
• All the current efforts on mining temporal relationships rely
on sequential algorithms and problem of scalable mining on
large scale interval data is not yet addressed.
• Solution. Design novel strategy to mine temporal patterns
on large scale interval data by augmenting
– Existing parallel mining algorithms for point-based events.
– Sequential pattern mining algorithms on interval data.
7. 7
Serial Frequent Itemset Mining Methods
• Mining frequent itemset is the first step, it is followed by
another step to generate inter transaction association rules.
• Apriori. It uses bread first strategy to count support of
itemset and uses candidate generation function which
exploits downward closure property of support.
• Eclat. Equivalent Class Transformation is depth first
algorithm. It converts the transactional database to its
vertical format i.e. transaction list for each item and then
uses set intersection.
• FP-growth. It doesn’t include candidate generation, instead
use a prefix tree structure FP-tree. It uses two passes over
data set and does recursive traversal of FP-tree for each
item in itemset.
8. 8
Parallel Itemset Mining
• Apriori-like parallel FIM algorithms such as FDM, DDM,
FPM, and MapReduce based DPC[2].
• Apriori-like solutions suffer potential problems of high I/O,
communication, and synchronization overhead, which make
it strenuous to scale up these parallel algorithms.
• Eclat-like most recent parallel algorithms include Dist-Eclat
and BigFIM[3].
• FP-growth-like parallel FIM algorithms such as and shared
memory based cache conscious FP-growth and most
popular MapReduce based PFP[4].
• Utrametric-tree based FIUT[5] and FiDoop[6].
• Others include recent lexicographical tree based Sequence
Growth[7].
9. 9
PFP algorithm
• Popular parallel FP-growth MapReduce based algorithm.
• Includes three MapReduce phases.
Sharding and
Parallel Counting
Group-dependent
Shard FP-growth
Aggregation
• Phase 1. Sharding divides the database in
consecutive parts and stores them in different
machines. Parallel Counting does a MapReduce task
for counting the support of the items. Each mapper
works on single shard.
• Phase 2. The frequent items are dividing in groups.
The mapper for each group id as key outputs the list
of transaction ids. The reducer then creates FP-tree
for each group.
• Phase 3. For all the items the corresponding frequent
patterns are listed out of which required number of
mostly supported patterns are reported.
10. 10
FiDoop Algorithm
• One of the recent parallel FIM algorithm outperforms Apriori-
like solution as well as FP-growth based PFP.
• Based on ultrametric tree extending FIUT.
• k-FIU-tree is built by placing all frequent itemsets of length k
starting from root to last item in itemset in a single path.
Hence, all the leaves are at same height k.
• Example.
abc 1
abd 2
acde 3
3-FIU-tree
root
a
b
c:1 d:2
itemsets
11. 11
FiDoop Design
• Uses three MapReduce phases like PFP.
• First MapReduce Job. discovers all frequent items or
frequent one-itemsets.
• Second MapReduce Job. scans the database to generate k-
itemsets by removing infrequent items in each transaction.
• Third MapReduce Job. constructs decomposed h-FIU-tree,
2≤h≤k-1, and mines all frequent h-itemsets
Input transaction
<LongWritable offset, Text
record>
Global one-itemset
<Text item, LongWritable
count>
Pruned transaction of k-itemset
<ArrayWritable k-item,
LongWritable 1>
<IntWritable id,
MapWritable<ArrayWritable k-
item, LongWritable SUM>>
<IntWritable id,
MapWritable<ArrayWritable
k-item, LongWritable SUM>>
Frequent h-itemset from h-
FIU-tree
MapReduce MapReduce MapReduce
12. 12
Pattern Mining on Interval Data
• Various algorithms have been proposed to discover temporal
patterns on interval data.
• Apriori-like. HDFS[8]: transforms event sequence into id-
lists and merges the id-lists iteratively, IEMiner[9]: reduce
search space and remove non promising candidates
• Pattern-growth. TPrefixSpan[10]: generates all possible
candidiates then scan the projected database recursively to
discover temporal patterns, TPMiner[11]: based on projection
database techniques and including several pruning
techniques to reduce search space.
13. 13
Interval Sequences
• A temporal database can handle data with time. It stores all
the interval sequences.
• An interval sequence is a collection of several intervals
having start time and end time.
• Example
Db contains 4 interval sequences.
Let minimum support = 3
Temporal pattern (C=D) is frequent
with support 4
14. 14
Temporal Relations
• Most of the pattern mining on interval data is based on 13
relations among temporal events proposed by Allen.
• Relations among two interval events X & Y is as shown
below.
15. 15
Heirarchical Representation
• Representation should be lossless otherwise spurious frequent patterns may be
discovered such that from representation the events arrangement can be
estimated reversably.
• Lossless Heirarchical Representation
P
Q
R
R
Q
P
R
Q
P
a. Overlap count wrt R=1
Meet count wrt R=0
b. Overlap count wrt R=2
Meet count wrt R=0
c. Overlap count wrt R=1
Meet count wrt R=1
Various Interpretation of temporal pattern (P o Q) o R
• IEMiner uses 5 variables to distinguish above interpretations contain count, finish
count, meet count, overlap count, and start count in order.
a. (P o[0,0,0,1,0] Q) o[0,0,0,1,0] R
b. (P o[0,0,0,1,0] Q) o[0,0,0,2,0] R
c. (P o[0,0,0,1,0] Q) o[0,0,1,1,0] R
16. 16
Conclusion
• The classic mining algorithms are modified to run in
distributed means on a cluster. Although much efforts are still
going on in field of pattern mining in interval data, to the best
of my knowledge one issue is not addressed anywhere.
• All the current pattern mining algorithms on interval-based
events are sequential in nature. They cannot scale to large
data set which cannot be stored in single memory. The
various parallel techniques in mining frequent patterns in
instantaneous events and current sequential techniques on
interval data can help in addressing this issue.
17. 17
References
[1] Ruan, Guangchen, et al. 2014. Parallel and quantitative sequential pattern
mining for large-scale interval-based temporal data. IEEE International
Conference on Big Data.
[2] Lin, Hsueh, et al. 2012. Apriori-based frequent itemset mining algorithms on
MapReduce. In Proceedings of the 6th International Conference on
Ubiquitous Information Management and Communication.
[3] Moens, Aksehirli, et al. 2013. Frequent itemset mining for big data. IEEE
International Conference on Big Data.
[4] Li, Haoyuan, et al. 2008. Pfp: parallel fp-growth for query recommendation.
Proceedings of the ACM conference on Recommender systems.
[5] Tsay, Yuh-Jiuan, et al. 2009. FIUT: A new method for mining frequent
itemsets. Proceedings of Information Sciences.
[6] Xun, Yaling, et al. 2015. FiDoop: Parallel Mining of Frequent Itemsets Using
MapReduce. IEEE Transactions on Systems, Man, and Cybernetics.
[7] Liang, Yen-Hui, et al. 2015. Sequence-Growth: A Scalable and Effective
Frequent Itemset Mining Algorithm for Big Data Based on MapReduce
Framework. IEEE International Conference on Big Data.
18. 18
References
[8] Papapetrou, Panagiotis, et al. 2005. Discovering frequent arrangements of
temporal intervals. Proceedings of Fifth IEEE International Conference on
Data Mining.
[9] Patel, et al. 2008. Mining relationships among interval events for
classification. Proceedings of the ACM SIGMOD international conference
on Management of data.
[10] Wu, Chen, et al. 2007. Mining nonambiguous temporal patterns for
interval-based events. IEEE Transactions on Knowledge and Data
Engineering.
[11] Chen, Yi-Cheng, et al. 2015. Mining Temporal Patterns in Time Interval-
based Data. IEEE Transactions on Knowledge and Data Engineering.