Data mining, data warehousing and knowledge discovery

Data Mining, Data
Warehousing and Knowledge
Discovery
Basic Algorithms and Concepts

Srinath Srinivasa
IIIT Bangalore
sri@iiitb.ac.in

Overview
• Why Data Mining?
• Data Mining concepts
• Data Mining algorithms
– Tabular data mining
– Association, Classification and Clustering
– Sequence data mining
– Streaming data mining
• Data Warehousing concepts

Why Data Mining
From a managerial perspective:

Analyzing trends
Wealth generation

Security

Strategic decision making

Data Mining
• Look for hidden patterns and trends in
data that is not immediately apparent
from summarizing the data

• No Query…

• …But an “Interestingness criteria”

Data Mining

+ =
Interestingness Hidden
Data criteria patterns

Data Mining Type
of
Patterns

+ =

Data Mining
Type of data Type of
Interestingness criteria

+ =

Type of Data
• Tabular (Ex: Transaction data)
– Relational
– Multi-dimensional
• Spatial (Ex: Remote sensing data)
• Temporal (Ex: Log information)
– Streaming (Ex: multimedia, network traffic)
– Spatio-temporal (Ex: GIS)
• Tree (Ex: XML data)
• Graphs (Ex: WWW, BioMolecular data)
• Sequence (Ex: DNA, activity logs)
• Text, Multimedia …

Type of Interestingness
• Frequency
• Rarity
• Correlation
• Length of occurrence (for sequence and temporal
data)
• Consistency
• Repeating / periodicity
• “Abnormal” behavior
• Other patterns of interestingness…

Data Mining vs Statistical Inference
Statistics:

Statistical
Conceptual Reasoning
Model
(Hypothesis
)

“Proof”
(Validation of Hypothesis)

Data Mining vs Statistical Inference
Data mining:

Mining
Algorithm
Data Based on
Interestingness

Pattern
(model, rule,
hypothesis)
discovery

Data Mining Concepts
Associations and Item-sets:

An association is a rule of the form: if X then Y.
It is denoted as X  Y
Example:
If India wins in cricket, sales of sweets go up.

For any rule if X  Y  Y  X, then X and Y are called
an “interesting item-set”.
Example:
People buying school uniforms in June also buy school bags
(People buying school bags in June also buy school uniforms)

Support and Confidence:

The support for a rule R is the ratio of the number of occurrences
of R, given all occurrences of all rules.

The confidence of a rule X  Y, is the ratio of the number of
occurrences of Y given X, among all other occurrences given X.

Support and Confidence:
Support for {Bag, Uniform} =
Bag Uniform Crayons 5/10 = 0.5
Books Bag Uniform
Bag Uniform Pencil
Bag Pencil Book Confidence for Bag  Uniform =
Uniform Crayons Bag 5/8 = 0.625
Bag Pencil Book
Crayons Uniform Bag
Books Crayons Bag
Uniform Crayons Pencil
Pencil Uniform Books

Mining for Frequent Item-sets
The Apriori Algorithm:

Given minimum required support s as interestingness criterion:
1. Search for all individual elements (1-element item-set) that
have a minimum support of s
2. Repeat
1. From the results of the previous search for i-element
item-sets, search for all i+1 element item-sets that have a
minimum support of s
2. This becomes the set of all frequent (i+1)-element item-
sets that are interesting
3. Until item-set size reaches maximum..

The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Uniform Crayons
Interesting 1-element item-sets:
Books Bag Uniform
{Bag}, {Uniform}, {Crayons}, {Pencil},
Bag Uniform Pencil {Books}
Bag Pencil Books
Uniform Crayons Bag Interesting 2-element item-sets:
Bag Pencil Books {Bag,Uniform} {Bag,Crayons} {Bag,Pencil}
{Bag,Books} {Uniform,Crayons}
Crayons Uniform Bag
{Uniform,Pencil} {Pencil,Books}
Books Crayons Bag

The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Uniform Crayons
Books Bag Uniform Interesting 3-element item-sets:
{Bag,Uniform,Crayons}
Bag Uniform Pencil
Bag Pencil Books
Uniform Crayons Bag
Bag Pencil Books
Crayons Uniform Bag
Books Crayons Bag

Mining for Association Rules
Association rules are of the form
Bag Uniform Crayons AB
Books Bag Uniform
Bag Uniform Pencil Which are directional…
Bag Pencil Books
Uniform Crayons Bag Association rule mining requires two
Bag Pencil Books thresholds:
Crayons Uniform Bag
Books Crayons Bag minsup and minconf

Mining association rules using apriori
General Procedure:
Bag Uniform Crayons
1. Use apriori to generate frequent
Books Bag Uniform itemsets of different sizes
Bag Uniform Pencil 2. At each iteration divide each frequent
Bag Pencil Books itemset X into two parts LHS and
Uniform Crayons Bag RHS. This represents a rule of the
form LHS  RHS
Bag Pencil Books
3. The confidence of such a rule is
Crayons Uniform Bag support(X)/support(LHS)
Books Crayons Bag 4. Discard all rules whose confidence is
Uniform Crayons Pencil less than minconf.

Example:
Bag Uniform Crayons
The frequent itemset {Bag, Uniform,
Books Bag Uniform Crayons} has a support of 0.3.
Bag Uniform Pencil
Bag Pencil Books This can be divided into the following
Uniform Crayons Bag rules:
{Bag}  {Uniform, Crayons}
Bag Pencil Books
{Bag, Uniform}  {Crayons}
Crayons Uniform Bag {Bag, Crayons}  {Uniform}
Books Crayons Bag {Uniform}  {Bag, Crayons}
Uniform Crayons Pencil {Uniform, Crayons}  {Bag}
Pencil Uniform Books {Crayons}  {Bag, Uniform}

Confidence for these rules are as follows:
Bag Uniform Crayons
{Bag}  {Uniform, Crayons} 0.375
Books Bag Uniform {Bag, Uniform}  {Crayons} 0.6
Bag Uniform Pencil {Bag, Crayons}  {Uniform} 0.75
Bag Pencil Books {Uniform}  {Bag, Crayons} 0.428
Uniform Crayons Bag {Uniform, Crayons}  {Bag} 0.75
{Crayons}  {Bag, Uniform} 0.75
Bag Pencil Books
Crayons Uniform Bag If minconf is 0.7, then we have discovered the
Books Crayons Bag following rules…

People who buy a school bag and a set of
crayons are likely to buy school
Bag Uniform Crayons
uniform.
Books Bag Uniform
Bag Uniform Pencil People who buy school uniform and a set
Bag Pencil Books of crayons are likely to buy a school
Uniform Crayons Bag bag.
Bag Pencil Books
People who buy just a set of crayons are
Crayons Uniform Bag likely to buy a school bag and school
Books Crayons Bag uniform as well.

Generalized Association Rules
Since customers can buy any number of items in one transaction,
the transaction relation would be in the form of a list of individual
purchases.

Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons

A transaction for the purposes of data mining is obtained by
performing a GROUP BY of the table over various fields.

Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons

A GROUP BY over Bill No. would show frequent buying patterns
across different customers.
A GROUP BY over Date would show frequent buying patterns
across different days.

Bill No. Date Item
15563 23.10.2003 Books
15563 23.10.2003 Crayons
15564 23.10.2003 Uniform
15564 23.10.2003 Crayons

Classification and Clustering
Given a set of data elements:

Classification maps each data element to one of a set of
pre-determined classes based on the difference among
data elements belonging to different classes

Clustering groups data elements into different groups
based on the similarity between elements within a single
group

Classification Techniques
Decision Tree Identification

Outlook Temp Play? Classification problem
Sunny 30 Yes
Overcast 15 No Weather
Sunny 16 Yes 
Cloudy 27 Yes Play(Yes,No)
Overcast 25 Yes
Overcast 17 No
Cloudy 17 No
Cloudy 35 Yes

Hunt’s method for decision tree identification:

Given N element types and m decision classes:
1. For i  1 to N do
1. Add element i to the i-1 element item-sets from the
previous iteration
2. Identify the set of decision classes for each item-set
3. If an item-set has only one decision class, then that
item-set is done, remove that item-set from subsequent
iterations
2. done

Decision Tree Identification Example

Outlook Temp Play?
Sunny Warm Yes Sunny Yes
Overcast Chilly No
Sunny Chilly Yes
Cloudy Yes/No
Cloudy Pleasant Yes
Overcast Pleasant Yes
Overcast Yes/No
Overcast Chilly No
Cloudy Chilly No
Cloudy Warm Yes


Outlook Temp Play?
Cloudy
Sunny Warm Yes Yes
Warm
Overcast Chilly No
Sunny Chilly Yes Cloudy
No
Cloudy Pleasant Yes Chilly
Cloudy
Overcast Chilly No Yes
Pleasant
Cloudy Chilly No
Cloudy Warm Yes


Outlook Temp Play?
Overcast
Sunny Warm Yes Warm
Overcast Chilly No
Sunny Chilly Yes Overcast
No
Cloudy Pleasant Yes Chilly
Overcast
Overcast Chilly No Yes
Pleasant
Cloudy Chilly No
Cloudy Warm Yes


Yes/No
Cloudy Overcast
Sunny

Yes/No Yes Yes/No

Warm Pleasant Chilly
Chilly
No Pleasant
Yes No Yes
Yes


• Top down technique for decision tree identification

• Decision tree created is sensitive to the order in which
items are considered

• If an N-item-set does not result in a clear decision,
classification classes have to be modeled by rough sets.

Other Classification Algorithms

Quinlan’s depth-first strategy builds the decision tree in a
depth-first fashion, by considering all possible tests that give a
decision and selecting the test that gives the best information
gain. It hence eliminates tests that are inconclusive.

SLIQ (Supervised Learning in Quest) developed in the
QUEST project of IBM uses a top-down breadth-first strategy
to build a decision tree. At each level in the tree, an entropy
value of each node is calculated and nodes having the lowest
entropy values selected and expanded.

Clustering Techniques
Clustering partitions the data set into clusters or equivalence
classes.

Similarity among members of a class more than similarity
among members across classes.

Similarity measures: Euclidian distance or other application
specific measures.

Euclidian Distance for Tables
(Overcast,Chilly,Don’t Play)

Overcast

(Cloudy,Pleasant,Play)
Cloudy
Don’t Play

Play
Sunny
Warm Pleasant Chilly

General Strategy:

1. Draw a graph connecting items which are close to one
another with edges.

2. Partition the graph into maximally connected
subcomponents.
1. Construct an MST for the graph
2. Merge items that are connected by the minimum
weight of the MST into a cluster

Clustering types:

Hierarchical clustering: Clusters are formed at different
levels by merging clusters at a lower level

Partitional clustering: Clusters are formed at only one level

Nearest Neighbour Clustering Algorithm:

Given n elements x1, x2, … xn, and threshold t, .
1. j  1, k  1, Clusters = {}
2. Repeat
1. Find the nearest neighbour of xj
2. Let the nearest neighbour be in cluster m
3. If distance to nearest neighbour > t, then create a new
cluster and k  k+1; else assign xj to cluster m
4. j  j+1
3. until j > n

Iterative partitional clustering:

Given n elements x1, x2, … xn, and k clusters, each with a
center.
1. Assign each element to its closest cluster center
2. After all assignments have been made, compute the
cluster centroids for each of the cluster
3. Repeat the above two steps with the new centroids until
the algorithm converges

Mining Sequence Data
Characteristics of Sequence Data:

• Collection of data elements which are ordered sequences

• In a sequence, each item has an index associated with it

• A k-sequence is a sequence of length k. Support for sequence
j is the number of m-sequences (m>=j) which contain j as a
sequence

• Sequence data: transaction logs, DNA sequences, patient
ailment history, …

Some Definitions:

• A sequence is a list of itemsets of finite length.
• Example:
• {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
• … the purchases of a single customer over time…

• The order of items within an itemset does not matter; but the
order of itemsets matter
• A subsequence is a sequence with some itemsets deleted

Some Definitions:

• A sequence S’ = {a1, a2, …, am} is said to be contained
within another sequence S, if S contains a subsequence {b1, b2,
… bm} such that a1 ⊆ b1, a2 ⊆ b2, …, am ⊆ bm.

• Hence, {pen}{pencil}{ruler,pencil} is contained in
{pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}

Apriori Algorithm for Sequences:

1. L1  Set of all interesting 1-sequences
2. k  1
3. while Lk is not empty do
1. Generate all candidate k+1 sequences
2. Lk+1  Set of all interesting k+1-sequences
3. done

Generating Candidate Sequences:

Given L1, L2, … Lk, candidate sequences of Lk+1 are generated
as follows:

For each sequence s in Lk, concatenate s with all new 1-
sequences found while generating Lk-1

Example:
minsup = 0.5
abcde Interesting 1-sequences:
bdae a
aebd b
be d
eabda e
aaaa
baaa Candidate 2-sequences
cbdb aa, ab, ad, ae
abbab ba, bb, bd, be
abde da, db, dd, de
ea, eb, ed, ee

Example:
minsup = 0.5
abcde Interesting 2-sequences:
bdae ab, bd
aebd
be Candidate 2-sequences
eabda aba, abb, abd, abe,
aaaa aab, bab, dab, eab,
baaa bda, bdb, bdd, bde,
cbdb bbd, dbd, ebd.
abbab
abde Interesting 3-sequences = {}

Language Inference:

Given a set of sequences, consider each sequence as the
behavioural trace of a machine, and infer the machine that can
display the given sequence as behavior.

aabb
ababcac
abbac
…

Input set of sequences Output state machine

• Inferring the syntax of a language given
its sentences
• Applications: discerning behavioural
patterns, emergent properties
discovery, collaboration modeling, …
• State machine discovery is the reverse
of state machine construction
• Discovery is “maximalist” in nature…

“Maximal” nature of language inference:

a,b,c
abc
aabc “Most general” state machine
aabbc c
abbc b
c
b
a
a c c
b b
“Most specific” state machine

“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)

Given a set of n sequences:
1. Create a state machine for the first sequence
2. for j  2 to n do
1. Create a state machine for the jth sequence
2. Merge this sequence into the earlier sequence as follows:
1. Merge all halt states in the new state machine to the
halt state in the existing state machine
2. If two or more paths to the halt state share the same
suffix, merge the suffixes together into a single path
3. Done

“Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000)

a a b c b
aabcb

aac a a b c b
c
aabc a a b c b

c
b
a a c b

Mining Streaming Data
Characteristics of streaming data:

• Large data sequence

• No storage

• Often an infinite sequence

• Examples: Stock market quotes, streaming audio/video,
network traffic

Running mean:

Let n = number of items read so far,

avg = running average calculated so far,

On reading the next number num:

avg  (n*avg+num) / (n+1)
n  n+1

Running variance:

var = ∑(num-avg)2

= ∑num2 - 2*∑num*avg + ∑avg2

Let A = ∑num2 of all numbers read so far
B = 2*∑num*avg of all numbers read so far
C = ∑avg2 of all numbers read so far
avg = average of numbers read so far
n = number of numbers read so far

Running variance:

On reading next number num:

avg  (avg*n + num) / (n+1)
n  n+1

A  A + num2
B  B + 2*avg*num
C  C + avg2

var = A + B + C

γ-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)

Let streaming data be in the form of “frames” where each
frame comprises of one or more data elements.

Support for data element k within a frame is defined as
(#occurrences of k)/(#elements in frame)

γ-Consistency for data element k is the “sustained” support
for k over all frames read so far, with a “leakage” of (1- γ)

γ-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
γ*sup(k)

(1-γ)

levelt(k) = (1-γ)*levelt-1(k) + γ*sup(k)

Data Warehousing
• A platform for online analytical processing (OLAP)
• Warehouses collect transactional data from several
transactional databases and organize them in a fashion
amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
enterprises
• Some typical DW queries:
– Which item sells best in each region that has retail outlets
– Which advertising strategy is best for South India?
– Which (age_group/occupation) in South India likes fast
food, and which (age_group/occupation) likes to cook?

Data Warehousing
OLTP
Orde
r Proc
e s s in
g

Data Cleaning
Inventory

les Data
Sa Warehouse
(OLAP)

OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data Archival data
Frequent insertions and Infrequent updates
updates
Small query shadow Very large query shadow
Normalization important to De-normalization important to
handle updates handle queries

Data Cleaning
• Performs logical transformation of
transactional data to suit the data
warehouse
• Model of operations  model of
enterprise
• Usually a semi-automatic process

Data Cleaning
Data Warehouse
Orders
Order_id Customers
Price Products
Cust_id Orders
Inventory
Price
Inventory Time
Prod_id
Price
Sales
Price_chng
Cust_id
Cust_prof
Tot_sales

Multi-dimensional Data Model
Price

Products

rs
de
Or
Customers
Jan’01 Jun’01 Jan’02 Jun’02

Time

Some MDBMS Operations
• Roll-up
– Add dimensions
• Drill-down
– Collapse dimensions
• Vector-distance operations (ex:
clustering)
• Vector space browsing

Star Schema

Dim Dim
Tbl_1 Tbl_1

Dim Fact table
Dim
Tbl_1 Tbl_1

WWW Based References
• http://www.kdnuggets.com/
• http://www.megaputer.com/
• http://www.almaden.ibm.com/cs/quest/index.html
• http://fas.sfu.ca/cs
/research/groups/DB/sections/publication/kdd/kdd.html
• http://www.cs.su.oz.au/~thierry/ckdd.html
• http://www.dwinfocenter.org/
• http://datawarehouse.itoolbox.com/
• http://www.knowledgestorm.com/
• http://www.bitpipe.com/
• http://www.dw-institute.com/
• http://www.datawarehousing.com/

References
• Agrawal, R. Srikant: ``Fast Algorithms for Mining
Association Rules'', Proc. of the 20th Int'l Conference on
Very Large Databases, Santiago, Chile, Sept. 1994.
• R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc.
of the Int'l Conference on Data Engineering (ICDE), Taipei,
Taiwan, March 1995.
• R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R.
Srikant: "The Quest Data Mining System", Proc. of the 2nd
Int'l Conference on Knowledge Discovery in Databases and
Data Mining, Portland, Oregon, August, 1996.
• Surajit Chaudhuri, Umesh Dayal. An Overview of Data
Warehousing and OLAP Technology. ACM SIGMOD Record.
26(1), March 1997.
• Jennifer Widom. Research Problems in Data Warehousing.
Proc. of Int’l Conf. On Information and Knowledge
Management, 1995.

References
• A. Shoshani. OLAP and Statistical Databases: Similarities and
Differences. Proc. of ACM PODS 1997.
• Panos Vassiliadis, Timos Sellis. A Survey on Logical Models
for OLAP Databases. ACM SIGMOD Record
• M. Gyssens, Laks VS Lakshmanan. A Foundation for Multi-
Dimensional Databases. Proc of VLDB 1997, Athens, Greece.
• Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions
Based on Consistent Patterns. Proc. of CoopIS 1999,
Edinburg, UK.
• Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral
Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000,
Como, Italy.

Data mining, data warehousing and knowledge discovery

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Data mining, data warehousing and knowledge discovery