Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Data mining, data warehousing and knowledge discovery
1. Data Mining, Data
Warehousing and Knowledge
Discovery
Basic Algorithms and Concepts
Srinath Srinivasa
IIIT Bangalore
sri@iiitb.ac.in
2. Overview
• Why Data Mining?
• Data Mining concepts
• Data Mining algorithms
– Tabular data mining
– Association, Classification and Clustering
– Sequence data mining
– Streaming data mining
• Data Warehousing concepts
3. Why Data Mining
From a managerial perspective:
Analyzing trends
Wealth generation
Security
Strategic decision making
4. Data Mining
• Look for hidden patterns and trends in
data that is not immediately apparent
from summarizing the data
• No Query…
• …But an “Interestingness criteria”
5. Data Mining
+ =
Interestingness Hidden
Data criteria patterns
6. Data Mining Type
of
Patterns
+ =
Interestingness Hidden
Data criteria patterns
7. Data Mining
Type of data Type of
Interestingness criteria
+ =
Interestingness Hidden
Data criteria patterns
9. Type of Interestingness
• Frequency
• Rarity
• Correlation
• Length of occurrence (for sequence and temporal
data)
• Consistency
• Repeating / periodicity
• “Abnormal” behavior
• Other patterns of interestingness…
10. Data Mining vs Statistical Inference
Statistics:
Statistical
Conceptual Reasoning
Model
(Hypothesis
)
“Proof”
(Validation of Hypothesis)
11. Data Mining vs Statistical Inference
Data mining:
Mining
Algorithm
Data Based on
Interestingness
Pattern
(model, rule,
hypothesis)
discovery
12. Data Mining Concepts
Associations and Item-sets:
An association is a rule of the form: if X then Y.
It is denoted as X Y
Example:
If India wins in cricket, sales of sweets go up.
For any rule if X Y Y X, then X and Y are called
an “interesting item-set”.
Example:
People buying school uniforms in June also buy school bags
(People buying school bags in June also buy school uniforms)
13. Data Mining Concepts
Support and Confidence:
The support for a rule R is the ratio of the number of occurrences
of R, given all occurrences of all rules.
The confidence of a rule X Y, is the ratio of the number of
occurrences of Y given X, among all other occurrences given X.
14. Data Mining Concepts
Support and Confidence:
Support for {Bag, Uniform} =
Bag Uniform Crayons 5/10 = 0.5
Books Bag Uniform
Bag Uniform Pencil
Bag Pencil Book Confidence for Bag Uniform =
Uniform Crayons Bag 5/8 = 0.625
Bag Pencil Book
Crayons Uniform Bag
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books
15. Mining for Frequent Item-sets
The Apriori Algorithm:
Given minimum required support s as interestingness criterion:
1. Search for all individual elements (1-element item-set) that
have a minimum support of s
2. Repeat
1. From the results of the previous search for i-element
item-sets, search for all i+1 element item-sets that have a
minimum support of s
2. This becomes the set of all frequent (i+1)-element item-
sets that are interesting
3. Until item-set size reaches maximum..
16. Mining for Frequent Item-sets
The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Uniform Crayons
Interesting 1-element item-sets:
Books Bag Uniform
{Bag}, {Uniform}, {Crayons}, {Pencil},
Bag Uniform Pencil {Books}
Bag Pencil Books
Uniform Crayons Bag Interesting 2-element item-sets:
Bag Pencil Books {Bag,Uniform} {Bag,Crayons} {Bag,Pencil}
{Bag,Books} {Uniform,Crayons}
Crayons Uniform Bag
{Uniform,Pencil} {Pencil,Books}
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books
17. Mining for Frequent Item-sets
The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Uniform Crayons
Books Bag Uniform Interesting 3-element item-sets:
{Bag,Uniform,Crayons}
Bag Uniform Pencil
Bag Pencil Books
Uniform Crayons Bag
Bag Pencil Books
Crayons Uniform Bag
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books
18. Mining for Association Rules
Association rules are of the form
Bag Uniform Crayons AB
Books Bag Uniform
Bag Uniform Pencil Which are directional…
Bag Pencil Books
Uniform Crayons Bag Association rule mining requires two
Bag Pencil Books thresholds:
Crayons Uniform Bag
Books Crayons Bag minsup and minconf
Uniform Crayons Pencil
Pencil Uniform Books
19. Mining for Association Rules
Mining association rules using apriori
General Procedure:
Bag Uniform Crayons
1. Use apriori to generate frequent
Books Bag Uniform itemsets of different sizes
Bag Uniform Pencil 2. At each iteration divide each frequent
Bag Pencil Books itemset X into two parts LHS and
Uniform Crayons Bag RHS. This represents a rule of the
form LHS RHS
Bag Pencil Books
3. The confidence of such a rule is
Crayons Uniform Bag support(X)/support(LHS)
Books Crayons Bag 4. Discard all rules whose confidence is
Uniform Crayons Pencil less than minconf.
Pencil Uniform Books
20. Mining for Association Rules
Mining association rules using apriori
Example:
Bag Uniform Crayons
The frequent itemset {Bag, Uniform,
Books Bag Uniform Crayons} has a support of 0.3.
Bag Uniform Pencil
Bag Pencil Books This can be divided into the following
Uniform Crayons Bag rules:
{Bag} {Uniform, Crayons}
Bag Pencil Books
{Bag, Uniform} {Crayons}
Crayons Uniform Bag {Bag, Crayons} {Uniform}
Books Crayons Bag {Uniform} {Bag, Crayons}
Uniform Crayons Pencil {Uniform, Crayons} {Bag}
Pencil Uniform Books {Crayons} {Bag, Uniform}
21. Mining for Association Rules
Mining association rules using apriori
Confidence for these rules are as follows:
Bag Uniform Crayons
{Bag} {Uniform, Crayons} 0.375
Books Bag Uniform {Bag, Uniform} {Crayons} 0.6
Bag Uniform Pencil {Bag, Crayons} {Uniform} 0.75
Bag Pencil Books {Uniform} {Bag, Crayons} 0.428
Uniform Crayons Bag {Uniform, Crayons} {Bag} 0.75
{Crayons} {Bag, Uniform} 0.75
Bag Pencil Books
Crayons Uniform Bag If minconf is 0.7, then we have discovered the
Books Crayons Bag following rules…
Uniform Crayons Pencil
Pencil Uniform Books
22. Mining for Association Rules
Mining association rules using apriori
People who buy a school bag and a set of
crayons are likely to buy school
Bag Uniform Crayons
uniform.
Books Bag Uniform
Bag Uniform Pencil People who buy school uniform and a set
Bag Pencil Books of crayons are likely to buy a school
Uniform Crayons Bag bag.
Bag Pencil Books
People who buy just a set of crayons are
Crayons Uniform Bag likely to buy a school bag and school
Books Crayons Bag uniform as well.
Uniform Crayons Pencil
Pencil Uniform Books
23. Generalized Association Rules
Since customers can buy any number of items in one transaction,
the transaction relation would be in the form of a list of individual
purchases.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
24. Generalized Association Rules
A transaction for the purposes of data mining is obtained by
performing a GROUP BY of the table over various fields.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
25. Generalized Association Rules
A GROUP BY over Bill No. would show frequent buying patterns
across different customers.
A GROUP BY over Date would show frequent buying patterns
across different days.
Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons
26. Classification and Clustering
Given a set of data elements:
Classification maps each data element to one of a set of
pre-determined classes based on the difference among
data elements belonging to different classes
Clustering groups data elements into different groups
based on the similarity between elements within a single
group
27. Classification Techniques
Decision Tree Identification
Outlook Temp Play? Classification problem
Sunny 30 Yes
Overcast 15 No Weather
Sunny 16 Yes
Cloudy 27 Yes Play(Yes,No)
Overcast 25 Yes
Overcast 17 No
Cloudy 17 No
Cloudy 35 Yes
28. Classification Techniques
Hunt’s method for decision tree identification:
Given N element types and m decision classes:
1. For i 1 to N do
1. Add element i to the i-1 element item-sets from the
previous iteration
2. Identify the set of decision classes for each item-set
3. If an item-set has only one decision class, then that
item-set is done, remove that item-set from subsequent
iterations
2. done
29. Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
30. Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes
31. Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Cloudy
Sunny Warm Yes Yes
Warm
Overcast Chilly No
Sunny Chilly Yes Cloudy
No
Cloudy Pleasant Yes Chilly
Overcast Pleasant Yes
Cloudy
Overcast Chilly No Yes
Pleasant
Cloudy Chilly No
Cloudy Warm Yes
32. Classification Techniques
Decision Tree Identification Example
Outlook Temp Play?
Overcast
Sunny Warm Yes Warm
Overcast Chilly No
Sunny Chilly Yes Overcast
No
Cloudy Pleasant Yes Chilly
Overcast Pleasant Yes
Overcast
Overcast Chilly No Yes
Pleasant
Cloudy Chilly No
Cloudy Warm Yes
33. Classification Techniques
Decision Tree Identification Example
Yes/No
Cloudy Overcast
Sunny
Yes/No Yes Yes/No
Warm Pleasant Chilly
Chilly
No Pleasant
Yes No Yes
Yes
34. Classification Techniques
Decision Tree Identification Example
• Top down technique for decision tree identification
• Decision tree created is sensitive to the order in which
items are considered
• If an N-item-set does not result in a clear decision,
classification classes have to be modeled by rough sets.
35. Other Classification Algorithms
Quinlan’s depth-first strategy builds the decision tree in a
depth-first fashion, by considering all possible tests that give a
decision and selecting the test that gives the best information
gain. It hence eliminates tests that are inconclusive.
SLIQ (Supervised Learning in Quest) developed in the
QUEST project of IBM uses a top-down breadth-first strategy
to build a decision tree. At each level in the tree, an entropy
value of each node is calculated and nodes having the lowest
entropy values selected and expanded.
36. Clustering Techniques
Clustering partitions the data set into clusters or equivalence
classes.
Similarity among members of a class more than similarity
among members across classes.
Similarity measures: Euclidian distance or other application
specific measures.
37. Euclidian Distance for Tables
(Overcast,Chilly,Don’t Play)
Overcast
(Cloudy,Pleasant,Play)
Cloudy
Don’t Play
Play
Sunny
Warm Pleasant Chilly
38. Clustering Techniques
General Strategy:
1. Draw a graph connecting items which are close to one
another with edges.
2. Partition the graph into maximally connected
subcomponents.
1. Construct an MST for the graph
2. Merge items that are connected by the minimum
weight of the MST into a cluster
40. Clustering Techniques
Nearest Neighbour Clustering Algorithm:
Given n elements x1, x2, … xn, and threshold t, .
1. j 1, k 1, Clusters = {}
2. Repeat
1. Find the nearest neighbour of xj
2. Let the nearest neighbour be in cluster m
3. If distance to nearest neighbour > t, then create a new
cluster and k k+1; else assign xj to cluster m
4. j j+1
3. until j > n
41. Clustering Techniques
Iterative partitional clustering:
Given n elements x1, x2, … xn, and k clusters, each with a
center.
1. Assign each element to its closest cluster center
2. After all assignments have been made, compute the
cluster centroids for each of the cluster
3. Repeat the above two steps with the new centroids until
the algorithm converges
42. Mining Sequence Data
Characteristics of Sequence Data:
• Collection of data elements which are ordered sequences
• In a sequence, each item has an index associated with it
• A k-sequence is a sequence of length k. Support for sequence
j is the number of m-sequences (m>=j) which contain j as a
sequence
• Sequence data: transaction logs, DNA sequences, patient
ailment history, …
43. Mining Sequence Data
Some Definitions:
• A sequence is a list of itemsets of finite length.
• Example:
• {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
• … the purchases of a single customer over time…
• The order of items within an itemset does not matter; but the
order of itemsets matter
• A subsequence is a sequence with some itemsets deleted
44. Mining Sequence Data
Some Definitions:
• A sequence S’ = {a1, a2, …, am} is said to be contained
within another sequence S, if S contains a subsequence {b1, b2,
… bm} such that a1 ⊆ b1, a2 ⊆ b2, …, am ⊆ bm.
• Hence, {pen}{pencil}{ruler,pencil} is contained in
{pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
45. Mining Sequence Data
Apriori Algorithm for Sequences:
1. L1 Set of all interesting 1-sequences
2. k 1
3. while Lk is not empty do
1. Generate all candidate k+1 sequences
2. Lk+1 Set of all interesting k+1-sequences
3. done
46. Mining Sequence Data
Generating Candidate Sequences:
Given L1, L2, … Lk, candidate sequences of Lk+1 are generated
as follows:
For each sequence s in Lk, concatenate s with all new 1-
sequences found while generating Lk-1
47. Mining Sequence Data
Example:
minsup = 0.5
abcde Interesting 1-sequences:
bdae a
aebd b
be d
eabda e
aaaa
baaa Candidate 2-sequences
cbdb aa, ab, ad, ae
abbab ba, bb, bd, be
abde da, db, dd, de
ea, eb, ed, ee
49. Mining Sequence Data
Language Inference:
Given a set of sequences, consider each sequence as the
behavioural trace of a machine, and infer the machine that can
display the given sequence as behavior.
aabb
ababcac
abbac
…
Input set of sequences Output state machine
50. Mining Sequence Data
• Inferring the syntax of a language given
its sentences
• Applications: discerning behavioural
patterns, emergent properties
discovery, collaboration modeling, …
• State machine discovery is the reverse
of state machine construction
• Discovery is “maximalist” in nature…
51. Mining Sequence Data
“Maximal” nature of language inference:
a,b,c
abc
aabc “Most general” state machine
aabbc c
abbc b
c
b
a
a c c
b b
“Most specific” state machine
52. Mining Sequence Data
“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)
Given a set of n sequences:
1. Create a state machine for the first sequence
2. for j 2 to n do
1. Create a state machine for the jth sequence
2. Merge this sequence into the earlier sequence as follows:
1. Merge all halt states in the new state machine to the
halt state in the existing state machine
2. If two or more paths to the halt state share the same
suffix, merge the suffixes together into a single path
3. Done
53. Mining Sequence Data
“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)
a a b c b
aabcb
aac a a b c b
c
aabc a a b c b
c
b
a a c b
54. Mining Streaming Data
Characteristics of streaming data:
• Large data sequence
• No storage
• Often an infinite sequence
• Examples: Stock market quotes, streaming audio/video,
network traffic
55. Mining Streaming Data
Running mean:
Let n = number of items read so far,
avg = running average calculated so far,
On reading the next number num:
avg (n*avg+num) / (n+1)
n n+1
56. Mining Streaming Data
Running variance:
var = ∑(num-avg)2
= ∑num2 - 2*∑num*avg + ∑avg2
Let A = ∑num2 of all numbers read so far
B = 2*∑num*avg of all numbers read so far
C = ∑avg2 of all numbers read so far
avg = average of numbers read so far
n = number of numbers read so far
57. Mining Streaming Data
Running variance:
On reading next number num:
avg (avg*n + num) / (n+1)
n n+1
A A + num2
B B + 2*avg*num
C C + avg2
var = A + B + C
58. Mining Streaming Data
γ-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
Let streaming data be in the form of “frames” where each
frame comprises of one or more data elements.
Support for data element k within a frame is defined as
(#occurrences of k)/(#elements in frame)
γ-Consistency for data element k is the “sustained” support
for k over all frames read so far, with a “leakage” of (1- γ)
60. Data Warehousing
• A platform for online analytical processing (OLAP)
• Warehouses collect transactional data from several
transactional databases and organize them in a fashion
amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
enterprises
• Some typical DW queries:
– Which item sells best in each region that has retail outlets
– Which advertising strategy is best for South India?
– Which (age_group/occupation) in South India likes fast
food, and which (age_group/occupation) likes to cook?
61. Data Warehousing
OLTP
Orde
r Proc
e s s in
g
Data Cleaning
Inventory
les Data
Sa Warehouse
(OLAP)
62. OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data Archival data
Frequent insertions and Infrequent updates
updates
Small query shadow Very large query shadow
Normalization important to De-normalization important to
handle updates handle queries
63. Data Cleaning
• Performs logical transformation of
transactional data to suit the data
warehouse
• Model of operations model of
enterprise
• Usually a semi-automatic process
64. Data Cleaning
Data Warehouse
Orders
Order_id Customers
Price Products
Cust_id Orders
Inventory
Price
Inventory Time
Prod_id
Price
Sales
Price_chng
Cust_id
Cust_prof
Tot_sales
69. References
• Agrawal, R. Srikant: ``Fast Algorithms for Mining
Association Rules'', Proc. of the 20th Int'l Conference on
Very Large Databases, Santiago, Chile, Sept. 1994.
• R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc.
of the Int'l Conference on Data Engineering (ICDE), Taipei,
Taiwan, March 1995.
• R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R.
Srikant: "The Quest Data Mining System", Proc. of the 2nd
Int'l Conference on Knowledge Discovery in Databases and
Data Mining, Portland, Oregon, August, 1996.
• Surajit Chaudhuri, Umesh Dayal. An Overview of Data
Warehousing and OLAP Technology. ACM SIGMOD Record.
26(1), March 1997.
• Jennifer Widom. Research Problems in Data Warehousing.
Proc. of Int’l Conf. On Information and Knowledge
Management, 1995.
70. References
• A. Shoshani. OLAP and Statistical Databases: Similarities and
Differences. Proc. of ACM PODS 1997.
• Panos Vassiliadis, Timos Sellis. A Survey on Logical Models
for OLAP Databases. ACM SIGMOD Record
• M. Gyssens, Laks VS Lakshmanan. A Foundation for Multi-
Dimensional Databases. Proc of VLDB 1997, Athens, Greece.
• Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions
Based on Consistent Patterns. Proc. of CoopIS 1999,
Edinburg, UK.
• Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral
Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000,
Como, Italy.