5. 基本概念 : 频繁模式和关联规
则
Itemset X={x1, …, xk}
Transaction-id Items bought 找出满足最先小支持度和置信度
10 A, B, C
的所有规则 X Y
20 A, C 支持度 , s, 事务包含 X∪Y 的
30 A, D 概率
40 B, E, F support(X Y ) =P(X∪Y )
顾客购 顾客购买 置信度 , c, 包含 X 事务同时
买二者 尿布 也包含 Y 的条件概率 . (百
分比)
confidence(X Y )=P(Y|X)
min_support = 50%,
设 频繁模式 , 满足最小支持度的
min_conf = 50%:
模式。
顾客购买
A C (50%, 66.7%)
啤酒
C A (50%, 100%)
03/27/12 Data Mining 5
6. 挖掘关联规则—一个例子
Transaction-id Items bought
最小支持度 50%
最小置信度 50%
10 A, B, C
20 A, C Frequent pattern Support
30 A, D {A} 75%
40 B, E, F {B} 50%
{C} 50%
{A, C} 50%
规则 A ⇒ C:
支持度 = support({A}∪{C}) = 50%
置信度 = support({A}∪{C})/support({A}) = 667%
03/27/12 Data Mining 6
10. Apriori 算法
算法伪代码 :
Ck: 长度为 k 的候选项集
Lk : 长度为 k 的频繁项集
L1 = { 频繁项 };
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = 由 Lk 产生的候选 ;
for each 数据库中的事务 t do
增加包含在 t 中的所有候选 Ck+1 的计数
Lk+1 = Ck+1 中满足 min_support 的候选
end
return ∪ k Lk;
03/27/12 Data Mining 10
12. 如何产生候选 ?
假定 Lk-1 中的项集已排序
步骤 1: Lk-1 自连接
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 = q.itemk-1
Step 2: 剪枝
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
03/27/12 Data Mining 12
16. DIC(Dynamic itemset counting ): 减少扫描次数
ABCD 一旦确定 A 和 D 是频繁的 , 立
即开始 AD 的计数
ABC ABD ACD BCD 一旦确定 BCD 的两个长度为 2 的
子集是频繁的 , 立即开始 BCD 的
计数
AB AC BC AD BD CD
事务
1-itemsets
A B C D 2-itemsets
Apriori …
{}
1-itemsets
Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and
S. Tsur. Dynamic itemset counting 3-items
DIC
and implication rules for market
basket data. In SIGMOD’97
03/27/12 Data Mining 16
17. 划分 : 只扫描数据库两次
项集在 DB 中是频繁的 , 它必须至少在 DB 的一个划分
中是频繁的
扫描 1: 划分数据库 , 并找出局部频繁模式
扫描 2: 求出全局频繁模式
A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association in large databases. In
VLDB’95
03/27/12 Data Mining 17
22. 划分模式和数据库
可以按照 f-list 将频繁模式划分成子集
F-list=f-c-a-b-m-p
包含 p 的模式
包含 m 但不包含 p 的模式
…
包含 c 但不包含 a, b, m, p
模式 f
03/27/12 Data Mining 22
23. 从 p- 条件数据库找出含 p 的模式
从 FP- 树的频繁项头表开始
沿着频繁项 p 的链搜索 FP- 树
收集项 p 的所有 变换的前缀路径 形成 p 的模式基
{}
头表
条件 模式基
Item frequency head f:4 c:1 item cond. pattern base
f 4 c f:3
c 4 c:3 b:1 b:1
a 3 a fc:3
b 3 a:3 p:1 b fca:1, f:1, c:1
m 3
p 3 m fca:2, fcab:1
m:2 b:1
p fcam:2, cb:1
p:2 m:1
03/27/12 Data Mining 23
24. 从条件模式基到条件 FP- 树
对于每个条件模式基
累计条件模式基中每个项的计数
构造模式基中频繁项的 FP- 树
m- 条件 模式基 :
{} fca:2, fcab:1
头表
Item frequency head f:4 c:1 涉及 m 的所有频
f 4 {} 繁模式
c 4 c:3 b:1 b:1
a 3 m,
f:3 fm, cm, am,
b 3 a:3 p:1
m 3 c:3 fcm, fam, cam,
p 3 m:2 b:1
fcam
p:2 m:1 a:3
m- 条件 FP- 树
03/27/12 Data Mining 24
66. 文献 : 频繁模式挖掘方法
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection
algorithm for generation of frequent itemsets. Journal of Parallel and
Distributed Computing, 2000.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules
between sets of items in large databases. SIGMOD'93, 207-216,
Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94 487-499, Santiago, Chile.
J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate
generation”. In Proc. ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May
2000.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for
discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994.
03/27/12 Data Mining 66
67. 文献 : 频繁模式挖掘方法
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
mining association rules in large databases. VLDB'95, 432-443, Zurich,
Switzerland.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques
for mining causal structures. VLDB'98, 594-605, New York, NY.
R. Srikant and R. Agrawal. Mining generalized association rules.
VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.
R. Srikant and R. Agrawal. Mining quantitative association rules in
large relational tables. SIGMOD'96, 1-12, Montreal, Canada.
H. Toivonen. Sampling large databases for association rules. VLDB'96,
134-145, Bombay, India, Sept. 1996.
M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms
for fast discovery of association rules. KDD’97. August 1997.
03/27/12 Data Mining 67
68. 文献 : 频繁模式挖掘 ( 性能改进
)
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset
counting and implication rules for market basket analysis. SIGMOD'97,
Tucson, Arizona, May 1997.
D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of
discovered association rules in large databases: An incremental updating
technique. ICDE'96, New Orleans, LA.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining
using two-dimensional optimized association rules: Scheme, algorithms,
and visualization. SIGMOD'96, Montreal, Canada.
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for
association rules. SIGMOD'97, Tucson, Arizona.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm
for mining association rules. SIGMOD'95, San Jose, CA, May 1995.
03/27/12 Data Mining 68
69. 文献 : 频繁模式挖掘 ( 性能改进
)
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong
rules. In G. Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery
in Databases,. AAAI/MIT Press, 1991.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm
for mining association rules. SIGMOD'95, San Jose, CA.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule
mining with relational database systems: Alternatives and implications.
SIGMOD'98, Seattle, WA.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama.
Computing optimized rectilinear regions for association rules. KDD'97,
Newport Beach, CA, Aug. 1997.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm
for discovery of association rules. Data Mining and Knowledge
Discovery, 1:343-374, 1997.
03/27/12 Data Mining 69
70. 文献 : 频繁模式挖掘 ( 外延 )
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:
Generalizing association rules to correlations. SIGMOD'97, 265-276,
Tucson, Arizona.
J. Han and Y. Fu. Discovery of multiple-level association rules from
large databases. VLDB'95, 420-431, Zurich, Switzerland.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I.
Verkamo. Finding interesting rules from large sets of discovered
association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New
York, NY.
03/27/12 Data Mining 70
71. 文献 : 频繁模式挖掘 ( 外延 )
B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97,
220-231, Birmingham, England.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association
rules. VLDB'96, 122-133, Bombay, India.
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97,
452-461, Tucson, Arizona.
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative
associations in a large database of customer transactions. ICDE'98, 494-502,
Orlando, FL, Feb. 1998.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov.
Query flocks: A generalization of association-rule mining. SIGMOD'98, 1-12,
Seattle, Washington.
J. Pei, A.K.H. Tung, J. Han. Fault-Tolerant Frequent Pattern Mining:
Problems and Challenges. SIGMOD DMKD’01, Santa Barbara, CA.
03/27/12 Data Mining 71
72. 文献 : 挖掘最大模式和闭项集
R. J. Bayardo. Efficiently mining long patterns from databases.
SIGMOD'98, 85-93, Seattle, Washington.
J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", Proc. 2000 ACM-SIGMOD Int.
Workshop on Data Mining and Knowledge Discovery (DMKD'00),
Dallas, TX, May 2000.
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. ICDT'99, 398-416, Jerusalem,
Israel, Jan. 1999.
M. Zaki. Generating Non-Redundant Association Rules. KDD'00.
Boston, MA. Aug. 2000
M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule
Mining, SIAM’02
03/27/12 Data Mining 72
73. 文献 : 基于约束的频繁模式挖
掘
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of
constrained correlated sets. ICDE'00, 512-521, San Diego, CA, Feb.
2000.
Y. Fu and J. Han. Meta-rule-guided mining of association rules in
relational databases. KDOOD'95, 39-46, Singapore, Dec. 1995.
J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based,
Multidimensional Data Mining", COMPUTER (special issues on Data
Mining), 32(8): 46-50, 1999.
L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, "Optimization of
Constrained Frequent Set Queries with 2-Variable Constraints",
SIGMOD’99
03/27/12 Data Mining 73
74. 文献 : 基于约束的频繁模式挖
掘
R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. “Exploratory mining
and pruning optimizations of constrained association rules.”
SIGMOD’98
J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets
with Convertible Constraints", Proc. 2001 Int. Conf. on Data
Engineering (ICDE'01), April 2001.
J. Pei and J. Han "Can We Push More Constraints into Frequent
Pattern Mining?", Proc. 2000 Int. Conf. on Knowledge Discovery and
Data Mining (KDD'00), Boston, MA, August 2000.
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item
constraints. KDD'97, 67-73, Newport Beach, California.
03/27/12 Data Mining 74
75. 文献 : 序列模式挖掘方法
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14,
Taipei, Taiwan.
R. Srikant and R. Agrawal. Mining sequential patterns:
Generalizations and performance improvements. EDBT’96.
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu,
"FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining",
Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining
(KDD'00), Boston, MA, August 2000.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent
episodes in event sequences. Data Mining and Knowledge Discovery,
1:259-289, 1997.
03/27/12 Data Mining 75
76. 文献 : 序列模式挖掘方法
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu,
"PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-
Projected Pattern Growth", Proc. 2001 Int. Conf. on Data Engineering
(ICDE'01), Heidelberg, Germany, April 2001.
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules.
ICDE'98, 412-421, Orlando, FL.
S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of
interesting patterns in association rules. VLDB'98, 368-379, New York,
NY.
M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98.
Novermber 1998.
M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Mining with Regular Expression Constraints. VLDB 1999: 223-234,
Edinburgh, Scotland.
03/27/12 Data Mining 76
77. 文献 : 空间 , 多媒体 , 文本和
Web 数据库频繁模式挖掘
K. Koperski, J. Han, and G. B. Marchisio, "Mining Spatial and Image
Data through Progressive Refinement Methods", Revue internationale
de gomatique (European Journal of GIS and Spatial Analysis),
9(4):425-440, 1999.
A. K. H. Tung, H. Lu, J. Han, and L. Feng, "Breaking the Barrier of
Transactions: Mining Inter-Transaction Association Rules", Proc. 1999
Int. Conf. on Knowledge Discovery and Data Mining (KDD'99), San
Diego, CA, Aug. 1999, pp. 297-301.
J. Han, G. Dong and Y. Yin, "Efficient Mining of Partial Periodic
Patterns in Time Series Database", Proc. 1999 Int. Conf. on Data
Engineering (ICDE'99), Sydney, Australia, March 1999, pp. 106-115.
03/27/12 Data Mining 77
78. 文献 : 空间 , 多媒体 , 文本和
Web 数据库频繁模式挖掘
H. Lu, L. Feng, and J. Han, "Beyond Intra-Transaction Association
Analysis:Mining Multi-Dimensional Inter-Transaction Association
Rules", ACM Transactions on Information Systems (TOIS’00), 18(4):
423-454, 2000.
O. R. Zaiane, M. Xin, J. Han, "Discovering Web Access Patterns and
Trends by Applying OLAP and Data Mining Technology on Web
Logs," Proc. Advances in Digital Librar ies Conf. (ADL'98), Santa
Barbara, CA, April 1998, pp. 19-29
O. R. Zaiane, J. Han, and H. Zhu, "Mining Recurrent Items in
Multimedia with Progressive Resolution Refinement", Proc. 2000 Int.
Conf. on Data Engineering (ICDE'00), San Diego, CA, Feb. 2000, pp.
461-470.
03/27/12 Data Mining 78
79. 文献 : 用于分类和数据方计算的频繁模式挖掘
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD'99, 359-370, Philadelphia, PA, June 1999.
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D.
Ullman. Computing iceberg queries efficiently. VLDB'98, 299-310, New
York, NY, Aug. 1998.
J. Han, J. Pei, G. Dong, and K. Wang, “Computing Iceberg Data Cubes
with Complex Measures”, Proc. ACM-SIGMOD’2001, Santa Barbara,
CA, May 2001.
M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of
multi-dimensional association rules using data cubes. KDD'97, 207-210,
Newport Beach, California.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD’99
T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades:
Generalizing association rules. Technical Report, Aug. 2000
03/27/12 Data Mining 79