SlideShare uma empresa Scribd logo
1 de 70
Data Mining, Data
Warehousing and Knowledge
        Discovery
  Basic Algorithms and Concepts


          Srinath Srinivasa
           IIIT Bangalore
           sri@iiitb.ac.in
Overview
• Why Data Mining?
• Data Mining concepts
• Data Mining algorithms
  –   Tabular data mining
  –   Association, Classification and Clustering
  –   Sequence data mining
  –   Streaming data mining
• Data Warehousing concepts
Why Data Mining
    From a managerial perspective:




                                                Analyzing trends
                            Wealth generation




                                                         Security


Strategic decision making
Data Mining
• Look for hidden patterns and trends in
  data that is not immediately apparent
  from summarizing the data

• No Query…

• …But an “Interestingness criteria”
Data Mining



       +                     =
           Interestingness       Hidden
Data           criteria          patterns
Data Mining                        Type
                                    of
                                   Patterns




       +                     =
           Interestingness       Hidden
Data           criteria          patterns
Data Mining
 Type of data    Type of
                 Interestingness criteria



          +                       =
                Interestingness             Hidden
Data                criteria                patterns
Type of Data
• Tabular                  (Ex: Transaction data)
    – Relational
    – Multi-dimensional
• Spatial                 (Ex: Remote sensing data)
• Temporal                (Ex: Log information)
    – Streaming       (Ex: multimedia, network traffic)
    – Spatio-temporal (Ex: GIS)
•   Tree               (Ex: XML data)
•   Graphs             (Ex: WWW, BioMolecular data)
•   Sequence           (Ex: DNA, activity logs)
•   Text, Multimedia …
Type of Interestingness
•   Frequency
•   Rarity
•   Correlation
•   Length of occurrence   (for sequence and temporal
    data)
•   Consistency
•   Repeating / periodicity
•   “Abnormal” behavior
•   Other patterns of interestingness…
Data Mining vs Statistical Inference
Statistics:

                                       Statistical
     Conceptual                        Reasoning
       Model
     (Hypothesis
         )


                            “Proof”
                   (Validation of Hypothesis)
Data Mining vs Statistical Inference
Data mining:

                               Mining
                               Algorithm
     Data                      Based on
                               Interestingness



               Pattern
               (model, rule,
                hypothesis)
               discovery
Data Mining Concepts
Associations and Item-sets:

An association is a rule of the form: if X then Y.
It is denoted as X  Y
Example:
         If India wins in cricket, sales of sweets go up.


For any rule if X  Y  Y  X, then X and Y are called
an “interesting item-set”.
Example:
       People buying school uniforms in June also buy school bags
       (People buying school bags in June also buy school uniforms)
Data Mining Concepts
Support and Confidence:

The support for a rule R is the ratio of the number of occurrences
of R, given all occurrences of all rules.


The confidence of a rule X  Y, is the ratio of the number of
occurrences of Y given X, among all other occurrences given X.
Data Mining Concepts
 Support and Confidence:
                              Support for {Bag, Uniform} =
  Bag     Uniform   Crayons    5/10 = 0.5
 Books      Bag     Uniform
  Bag     Uniform    Pencil
  Bag      Pencil    Book     Confidence for Bag  Uniform =
Uniform   Crayons     Bag       5/8 = 0.625
  Bag      Pencil    Book
Crayons   Uniform     Bag
 Books    Crayons     Bag
Uniform   Crayons    Pencil
 Pencil   Uniform    Books
Mining for Frequent Item-sets
The Apriori Algorithm:

Given minimum required support s as interestingness criterion:
1. Search for all individual elements (1-element item-set) that
   have a minimum support of s
2. Repeat
   1. From the results of the previous search for i-element
      item-sets, search for all i+1 element item-sets that have a
      minimum support of s
   2. This becomes the set of all frequent (i+1)-element item-
      sets that are interesting
3. Until item-set size reaches maximum..
Mining for Frequent Item-sets
 The Apriori Algorithm: (Example)
                              Let minimum support = 0.3
  Bag     Uniform   Crayons
                              Interesting 1-element item-sets:
 Books      Bag     Uniform
                              {Bag}, {Uniform}, {Crayons}, {Pencil},
  Bag     Uniform    Pencil   {Books}
  Bag      Pencil    Books
Uniform   Crayons     Bag     Interesting 2-element item-sets:
  Bag      Pencil    Books    {Bag,Uniform} {Bag,Crayons} {Bag,Pencil}
                              {Bag,Books} {Uniform,Crayons}
Crayons   Uniform     Bag
                              {Uniform,Pencil} {Pencil,Books}
 Books    Crayons     Bag
Uniform   Crayons    Pencil
 Pencil   Uniform    Books
Mining for Frequent Item-sets
 The Apriori Algorithm: (Example)
                               Let minimum support = 0.3
  Bag     Uniform   Crayons
 Books      Bag     Uniform Interesting 3-element item-sets:
                            {Bag,Uniform,Crayons}
  Bag     Uniform    Pencil
  Bag      Pencil    Books
Uniform   Crayons     Bag
  Bag      Pencil    Books
Crayons   Uniform     Bag
 Books    Crayons     Bag
Uniform   Crayons    Pencil
 Pencil   Uniform    Books
Mining for Association Rules
                            Association rules are of the form
  Bag     Uniform   Crayons        AB
 Books      Bag     Uniform
  Bag     Uniform    Pencil Which are directional…
  Bag      Pencil    Books
Uniform   Crayons     Bag   Association rule mining requires two
  Bag      Pencil    Books thresholds:
Crayons   Uniform     Bag
 Books    Crayons     Bag   minsup and minconf
Uniform   Crayons    Pencil
 Pencil   Uniform    Books
Mining for Association Rules
 Mining association rules using apriori
                              General Procedure:
  Bag     Uniform   Crayons
                              1.   Use apriori to generate frequent
 Books      Bag     Uniform        itemsets of different sizes
  Bag     Uniform    Pencil   2.   At each iteration divide each frequent
  Bag      Pencil    Books         itemset X into two parts LHS and
Uniform   Crayons     Bag          RHS. This represents a rule of the
                                   form LHS  RHS
  Bag      Pencil    Books
                              3.   The confidence of such a rule is
Crayons   Uniform     Bag          support(X)/support(LHS)
 Books    Crayons     Bag     4.   Discard all rules whose confidence is
Uniform   Crayons    Pencil        less than minconf.
 Pencil   Uniform    Books
Mining for Association Rules
 Mining association rules using apriori
                              Example:
  Bag     Uniform   Crayons
                              The frequent itemset {Bag, Uniform,
 Books      Bag     Uniform       Crayons} has a support of 0.3.
  Bag     Uniform    Pencil
  Bag      Pencil    Books    This can be divided into the following
Uniform   Crayons     Bag         rules:
                              {Bag}  {Uniform, Crayons}
  Bag      Pencil    Books
                              {Bag, Uniform}  {Crayons}
Crayons   Uniform     Bag     {Bag, Crayons}  {Uniform}
 Books    Crayons     Bag     {Uniform}  {Bag, Crayons}
Uniform   Crayons    Pencil   {Uniform, Crayons}  {Bag}
 Pencil   Uniform    Books    {Crayons}  {Bag, Uniform}
Mining for Association Rules
 Mining association rules using apriori
                              Confidence for these rules are as follows:
  Bag     Uniform   Crayons
                              {Bag}  {Uniform, Crayons}       0.375
 Books      Bag     Uniform   {Bag, Uniform}  {Crayons}       0.6
  Bag     Uniform    Pencil   {Bag, Crayons}  {Uniform}       0.75
  Bag      Pencil    Books    {Uniform}  {Bag, Crayons}       0.428
Uniform   Crayons     Bag     {Uniform, Crayons}  {Bag}       0.75
                              {Crayons}  {Bag, Uniform}       0.75
  Bag      Pencil    Books
Crayons   Uniform     Bag     If minconf is 0.7, then we have discovered the
 Books    Crayons     Bag      following rules…
Uniform   Crayons    Pencil
 Pencil   Uniform    Books
Mining for Association Rules
 Mining association rules using apriori
                          People who buy a school bag and a set of
                              crayons are likely to buy school
  Bag     Uniform Crayons
                              uniform.
 Books      Bag     Uniform
  Bag     Uniform    Pencil People who buy school uniform and a set
  Bag      Pencil    Books      of crayons are likely to buy a school
Uniform   Crayons     Bag       bag.
  Bag      Pencil    Books
                            People who buy just a set of crayons are
Crayons   Uniform     Bag       likely to buy a school bag and school
 Books    Crayons     Bag       uniform as well.
Uniform   Crayons    Pencil
 Pencil   Uniform    Books
Generalized Association Rules
Since customers can buy any number of items in one transaction,
the transaction relation would be in the form of a list of individual
purchases.


Bill No.          Date             Item
15563             23.10.2003       Books
15563             23.10.2003       Crayons
15564             23.10.2003       Uniform
15564             23.10.2003       Crayons
Generalized Association Rules
A transaction for the purposes of data mining is obtained by
performing a GROUP BY of the table over various fields.



Bill No.         Date             Item
15563            23.10.2003       Books
15563            23.10.2003       Crayons
15564            23.10.2003       Uniform
15564            23.10.2003       Crayons
Generalized Association Rules
A GROUP BY over Bill No. would show frequent buying patterns
across different customers.
A GROUP BY over Date would show frequent buying patterns
across different days.

       Bill No.       Date           Item
       15563          23.10.2003     Books
       15563          23.10.2003     Crayons
       15564          23.10.2003     Uniform
       15564          23.10.2003     Crayons
Classification and Clustering
Given a set of data elements:

   Classification maps each data element to one of a set of
   pre-determined classes based on the difference among
   data elements belonging to different classes

   Clustering groups data elements into different groups
   based on the similarity between elements within a single
   group
Classification Techniques
Decision Tree Identification

Outlook       Temp      Play?   Classification problem
Sunny         30        Yes
Overcast      15        No      Weather
Sunny         16        Yes          
Cloudy        27        Yes              Play(Yes,No)
Overcast      25        Yes
Overcast      17        No
Cloudy        17        No
Cloudy        35        Yes
Classification Techniques
Hunt’s method for decision tree identification:

Given N element types and m decision classes:
1. For i  1 to N do
   1. Add element i to the i-1 element item-sets from the
      previous iteration
   2. Identify the set of decision classes for each item-set
   3. If an item-set has only one decision class, then that
      item-set is done, remove that item-set from subsequent
      iterations
2. done
Classification Techniques
Decision Tree Identification Example

Outlook    Temp     Play?
Sunny      Warm     Yes          Sunny      Yes
Overcast   Chilly   No
Sunny      Chilly   Yes
                                 Cloudy    Yes/No
Cloudy     Pleasant Yes
Overcast   Pleasant Yes
                                Overcast   Yes/No
Overcast   Chilly   No
Cloudy     Chilly   No
Cloudy     Warm     Yes
Classification Techniques
Decision Tree Identification Example

Outlook    Temp     Play?
Sunny      Warm     Yes          Sunny      Yes
Overcast   Chilly   No
Sunny      Chilly   Yes
                                 Cloudy    Yes/No
Cloudy     Pleasant Yes
Overcast   Pleasant Yes
                                Overcast   Yes/No
Overcast   Chilly   No
Cloudy     Chilly   No
Cloudy     Warm     Yes
Classification Techniques
Decision Tree Identification Example

Outlook    Temp     Play?
                                Cloudy
Sunny      Warm     Yes                     Yes
                                Warm
Overcast   Chilly   No
Sunny      Chilly   Yes          Cloudy
                                            No
Cloudy     Pleasant Yes          Chilly
Overcast   Pleasant Yes
                                 Cloudy
Overcast   Chilly   No                      Yes
                                 Pleasant
Cloudy     Chilly   No
Cloudy     Warm     Yes
Classification Techniques
Decision Tree Identification Example

Outlook    Temp     Play?
                               Overcast
Sunny      Warm     Yes         Warm
Overcast   Chilly   No
Sunny      Chilly   Yes         Overcast
                                            No
Cloudy     Pleasant Yes          Chilly
Overcast   Pleasant Yes
                                 Overcast
Overcast   Chilly   No                      Yes
                                 Pleasant
Cloudy     Chilly   No
Cloudy     Warm     Yes
Classification Techniques
Decision Tree Identification Example

                         Yes/No
        Cloudy                         Overcast
                            Sunny

   Yes/No                 Yes                 Yes/No

 Warm         Pleasant                   Chilly
            Chilly
                                         No        Pleasant
Yes           No            Yes
                                                       Yes
Classification Techniques
Decision Tree Identification Example

• Top down technique for decision tree identification

• Decision tree created is sensitive to the order in which
items are considered

• If an N-item-set does not result in a clear decision,
classification classes have to be modeled by rough sets.
Other Classification Algorithms

Quinlan’s depth-first strategy builds the decision tree in a
depth-first fashion, by considering all possible tests that give a
decision and selecting the test that gives the best information
gain. It hence eliminates tests that are inconclusive.

SLIQ (Supervised Learning in Quest) developed in the
QUEST project of IBM uses a top-down breadth-first strategy
to build a decision tree. At each level in the tree, an entropy
value of each node is calculated and nodes having the lowest
entropy values selected and expanded.
Clustering Techniques
Clustering partitions the data set into clusters or equivalence
classes.

Similarity among members of a class more than similarity
among members across classes.


Similarity measures: Euclidian distance or other application
specific measures.
Euclidian Distance for Tables
                                              (Overcast,Chilly,Don’t Play)


     Overcast


                 (Cloudy,Pleasant,Play)
      Cloudy
                        Don’t Play

                Play
      Sunny
           Warm         Pleasant     Chilly
Clustering Techniques
General Strategy:

1. Draw a graph connecting items which are close to one
   another with edges.

2. Partition the graph into maximally connected
   subcomponents.
   1. Construct an MST for the graph
   2. Merge items that are connected by the minimum
       weight of the MST into a cluster
Clustering Techniques
Clustering types:

Hierarchical clustering: Clusters are formed at different
   levels by merging clusters at a lower level


Partitional clustering: Clusters are formed at only one level
Clustering Techniques
Nearest Neighbour Clustering Algorithm:

Given n elements x1, x2, … xn, and threshold t, .
1. j  1, k  1, Clusters = {}
2. Repeat
   1. Find the nearest neighbour of xj
   2. Let the nearest neighbour be in cluster m
   3. If distance to nearest neighbour > t, then create a new
       cluster and k  k+1; else assign xj to cluster m
   4. j  j+1
3. until j > n
Clustering Techniques
Iterative partitional clustering:

Given n elements x1, x2, … xn, and k clusters, each with a
   center.
1. Assign each element to its closest cluster center
2. After all assignments have been made, compute the
   cluster centroids for each of the cluster
3. Repeat the above two steps with the new centroids until
   the algorithm converges
Mining Sequence Data
Characteristics of Sequence Data:

• Collection of data elements which are ordered sequences

• In a sequence, each item has an index associated with it

• A k-sequence is a sequence of length k. Support for sequence
j is the number of m-sequences (m>=j) which contain j as a
sequence

• Sequence data: transaction logs, DNA sequences, patient
ailment history, …
Mining Sequence Data
Some Definitions:

• A sequence is a list of itemsets of finite length.
• Example:
    • {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
    • … the purchases of a single customer over time…

• The order of items within an itemset does not matter; but the
order of itemsets matter
• A subsequence is a sequence with some itemsets deleted
Mining Sequence Data
Some Definitions:

• A sequence S’ = {a1, a2, …, am} is said to be contained
within another sequence S, if S contains a subsequence {b1, b2,
… bm} such that a1 ⊆ b1, a2 ⊆ b2, …, am ⊆ bm.

• Hence, {pen}{pencil}{ruler,pencil} is contained in
{pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
Mining Sequence Data
Apriori Algorithm for Sequences:

1. L1  Set of all interesting 1-sequences
2. k  1
3. while Lk is not empty do
   1. Generate all candidate k+1 sequences
   2. Lk+1  Set of all interesting k+1-sequences
3. done
Mining Sequence Data
Generating Candidate Sequences:

Given L1, L2, … Lk, candidate sequences of Lk+1 are generated
   as follows:

For each sequence s in Lk, concatenate s with all new 1-
   sequences found while generating Lk-1
Mining Sequence Data
Example:
           minsup = 0.5
abcde      Interesting 1-sequences:
bdae         a
aebd         b
be           d
eabda        e
aaaa
baaa       Candidate 2-sequences
cbdb        aa, ab, ad, ae
abbab        ba, bb, bd, be
abde         da, db, dd, de
             ea, eb, ed, ee
Mining Sequence Data
Example:
           minsup = 0.5
abcde      Interesting 2-sequences:
bdae         ab, bd
aebd
be         Candidate 2-sequences
eabda       aba, abb, abd, abe,
aaaa        aab, bab, dab, eab,
baaa         bda, bdb, bdd, bde,
cbdb         bbd, dbd, ebd.
abbab
abde       Interesting 3-sequences = {}
Mining Sequence Data
Language Inference:

Given a set of sequences, consider each sequence as the
behavioural trace of a machine, and infer the machine that can
display the given sequence as behavior.

                aabb
               ababcac
                abbac
                 …

       Input set of sequences           Output state machine
Mining Sequence Data
• Inferring the syntax of a language given
  its sentences
• Applications: discerning behavioural
  patterns, emergent properties
  discovery, collaboration modeling, …
• State machine discovery is the reverse
  of state machine construction
• Discovery is “maximalist” in nature…
Mining Sequence Data
“Maximal” nature of language inference:

                                                     a,b,c
  abc
  aabc            “Most general” state machine
  aabbc                                                              c
  abbc                                                b
                                                          c
                                          b
                               a
                                      a                      c           c
                                                 b               b
                 “Most specific” state machine
Mining Sequence Data
“Shortest-run Generalization”      (Srinivasa and Spiliopoulou 2000)

Given a set of n sequences:
1. Create a state machine for the first sequence
2. for j  2 to n do
   1. Create a state machine for the jth sequence
   2. Merge this sequence into the earlier sequence as follows:
       1. Merge all halt states in the new state machine to the
           halt state in the existing state machine
       2. If two or more paths to the halt state share the same
           suffix, merge the suffixes together into a single path
3. Done
Mining Sequence Data
“Shortest-run Generalization”        (Srinivasa and Spiliopoulou 2000)


                     a       a           b       c       b
aabcb

aac                     a        a       b       c       b
                                                  c
aabc                     a       a           b    c          b

                                                 c
                                     b
                    a        a           c           b
Mining Streaming Data
Characteristics of streaming data:

• Large data sequence

• No storage

• Often an infinite sequence

• Examples: Stock market quotes, streaming audio/video,
network traffic
Mining Streaming Data
Running mean:

Let n = number of items read so far,

   avg = running average calculated so far,


On reading the next number num:

     avg  (n*avg+num) / (n+1)
       n  n+1
Mining Streaming Data
Running variance:

      var = ∑(num-avg)2

         = ∑num2 - 2*∑num*avg + ∑avg2

Let A = ∑num2 of all numbers read so far
    B = 2*∑num*avg of all numbers read so far
    C = ∑avg2 of all numbers read so far
  avg = average of numbers read so far
    n = number of numbers read so far
Mining Streaming Data
Running variance:

On reading next number num:

avg  (avg*n + num) / (n+1)
 n  n+1

A  A + num2
B  B + 2*avg*num
C  C + avg2

var = A + B + C
Mining Streaming Data
γ-Consistency:    (Srinivasa and Spiliopoulou, CoopIS 1999)

Let streaming data be in the form of “frames” where each
frame comprises of one or more data elements.

Support for data element k within a frame is defined as
(#occurrences of k)/(#elements in frame)

γ-Consistency for data element k is the “sustained” support
for k over all frames read so far, with a “leakage” of (1- γ)
Mining Streaming Data
γ-Consistency:   (Srinivasa and Spiliopoulou, CoopIS 1999)
                             γ*sup(k)




                                 (1-γ)

       levelt(k) = (1-γ)*levelt-1(k) + γ*sup(k)
Data Warehousing
• A platform for online analytical processing (OLAP)
• Warehouses collect transactional data from several
  transactional databases and organize them in a fashion
  amenable to analysis
• Also called “data marts”
• A critical component of the decision support system (DSS) of
  enterprises
• Some typical DW queries:
   – Which item sells best in each region that has retail outlets
   – Which advertising strategy is best for South India?
   – Which (age_group/occupation) in South India likes fast
      food, and which (age_group/occupation) likes to cook?
Data Warehousing
                   OLTP
 Orde
      r   Proc
                 e s s in
                            g

                                         Data Cleaning
    Inventory



                                   les           Data
                                Sa             Warehouse
                                                (OLAP)
OLTP vs OLAP
  Transactional Data (OLTP)         Analysis Data (OLAP)
Small or medium size databases Very large databases
Transient data                 Archival data
Frequent insertions and        Infrequent updates
updates
Small query shadow             Very large query shadow
Normalization important to     De-normalization important to
handle updates                 handle queries
Data Cleaning
• Performs logical transformation of
  transactional data to suit the data
  warehouse
• Model of operations  model of
  enterprise
• Usually a semi-automatic process
Data Cleaning
                          Data Warehouse
Orders
Order_id                  Customers
 Price                    Products
Cust_id                   Orders
                          Inventory
                          Price
 Inventory                Time
  Prod_id
    Price
                Sales
 Price_chng
               Cust_id
              Cust_prof
              Tot_sales
Multi-dimensional Data Model
Price




                    Products




                                                 rs
                                             de
                                           Or
        Customers
         Jan’01      Jun’01    Jan’02   Jun’02

           Time
Some MDBMS Operations
• Roll-up
  – Add dimensions
• Drill-down
  – Collapse dimensions
• Vector-distance operations (ex:
  clustering)
• Vector space browsing
Star Schema

          Dim                  Dim
         Tbl_1                Tbl_1




  Dim            Fact table
                                 Dim
 Tbl_1                          Tbl_1
WWW Based References
•   http://www.kdnuggets.com/
•   http://www.megaputer.com/
•   http://www.almaden.ibm.com/cs/quest/index.html
•   http://fas.sfu.ca/cs
    /research/groups/DB/sections/publication/kdd/kdd.html
•   http://www.cs.su.oz.au/~thierry/ckdd.html
•   http://www.dwinfocenter.org/
•   http://datawarehouse.itoolbox.com/
•   http://www.knowledgestorm.com/
•   http://www.bitpipe.com/
•   http://www.dw-institute.com/
•   http://www.datawarehousing.com/
References
•    Agrawal, R. Srikant: ``Fast Algorithms for Mining
     Association Rules'', Proc. of the 20th Int'l Conference on
     Very Large Databases, Santiago, Chile, Sept. 1994.
•    R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc.
     of the Int'l Conference on Data Engineering (ICDE), Taipei,
     Taiwan, March 1995.
•    R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R.
     Srikant: "The Quest Data Mining System", Proc. of the 2nd
     Int'l Conference on Knowledge Discovery in Databases and
     Data Mining, Portland, Oregon, August, 1996.
•    Surajit Chaudhuri, Umesh Dayal. An Overview of Data
     Warehousing and OLAP Technology. ACM SIGMOD Record.
     26(1), March 1997.
•    Jennifer Widom. Research Problems in Data Warehousing.
     Proc. of Int’l Conf. On Information and Knowledge
     Management, 1995.
References
•    A. Shoshani. OLAP and Statistical Databases: Similarities and
     Differences. Proc. of ACM PODS 1997.
•    Panos Vassiliadis, Timos Sellis. A Survey on Logical Models
     for OLAP Databases. ACM SIGMOD Record
•    M. Gyssens, Laks VS Lakshmanan. A Foundation for Multi-
     Dimensional Databases. Proc of VLDB 1997, Athens, Greece.
•    Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions
     Based on Consistent Patterns. Proc. of CoopIS 1999,
     Edinburg, UK.
•    Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral
     Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000,
     Como, Italy.

Mais conteúdo relacionado

Último

Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 

Último (20)

Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 

Destaque

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 

Destaque (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

Data mining, data warehousing and knowledge discovery

  • 1. Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts Srinath Srinivasa IIIT Bangalore sri@iiitb.ac.in
  • 2. Overview • Why Data Mining? • Data Mining concepts • Data Mining algorithms – Tabular data mining – Association, Classification and Clustering – Sequence data mining – Streaming data mining • Data Warehousing concepts
  • 3. Why Data Mining From a managerial perspective: Analyzing trends Wealth generation Security Strategic decision making
  • 4. Data Mining • Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data • No Query… • …But an “Interestingness criteria”
  • 5. Data Mining + = Interestingness Hidden Data criteria patterns
  • 6. Data Mining Type of Patterns + = Interestingness Hidden Data criteria patterns
  • 7. Data Mining Type of data Type of Interestingness criteria + = Interestingness Hidden Data criteria patterns
  • 8. Type of Data • Tabular (Ex: Transaction data) – Relational – Multi-dimensional • Spatial (Ex: Remote sensing data) • Temporal (Ex: Log information) – Streaming (Ex: multimedia, network traffic) – Spatio-temporal (Ex: GIS) • Tree (Ex: XML data) • Graphs (Ex: WWW, BioMolecular data) • Sequence (Ex: DNA, activity logs) • Text, Multimedia …
  • 9. Type of Interestingness • Frequency • Rarity • Correlation • Length of occurrence (for sequence and temporal data) • Consistency • Repeating / periodicity • “Abnormal” behavior • Other patterns of interestingness…
  • 10. Data Mining vs Statistical Inference Statistics: Statistical Conceptual Reasoning Model (Hypothesis ) “Proof” (Validation of Hypothesis)
  • 11. Data Mining vs Statistical Inference Data mining: Mining Algorithm Data Based on Interestingness Pattern (model, rule, hypothesis) discovery
  • 12. Data Mining Concepts Associations and Item-sets: An association is a rule of the form: if X then Y. It is denoted as X  Y Example: If India wins in cricket, sales of sweets go up. For any rule if X  Y  Y  X, then X and Y are called an “interesting item-set”. Example: People buying school uniforms in June also buy school bags (People buying school bags in June also buy school uniforms)
  • 13. Data Mining Concepts Support and Confidence: The support for a rule R is the ratio of the number of occurrences of R, given all occurrences of all rules. The confidence of a rule X  Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X.
  • 14. Data Mining Concepts Support and Confidence: Support for {Bag, Uniform} = Bag Uniform Crayons 5/10 = 0.5 Books Bag Uniform Bag Uniform Pencil Bag Pencil Book Confidence for Bag  Uniform = Uniform Crayons Bag 5/8 = 0.625 Bag Pencil Book Crayons Uniform Bag Books Crayons Bag Uniform Crayons Pencil Pencil Uniform Books
  • 15. Mining for Frequent Item-sets The Apriori Algorithm: Given minimum required support s as interestingness criterion: 1. Search for all individual elements (1-element item-set) that have a minimum support of s 2. Repeat 1. From the results of the previous search for i-element item-sets, search for all i+1 element item-sets that have a minimum support of s 2. This becomes the set of all frequent (i+1)-element item- sets that are interesting 3. Until item-set size reaches maximum..
  • 16. Mining for Frequent Item-sets The Apriori Algorithm: (Example) Let minimum support = 0.3 Bag Uniform Crayons Interesting 1-element item-sets: Books Bag Uniform {Bag}, {Uniform}, {Crayons}, {Pencil}, Bag Uniform Pencil {Books} Bag Pencil Books Uniform Crayons Bag Interesting 2-element item-sets: Bag Pencil Books {Bag,Uniform} {Bag,Crayons} {Bag,Pencil} {Bag,Books} {Uniform,Crayons} Crayons Uniform Bag {Uniform,Pencil} {Pencil,Books} Books Crayons Bag Uniform Crayons Pencil Pencil Uniform Books
  • 17. Mining for Frequent Item-sets The Apriori Algorithm: (Example) Let minimum support = 0.3 Bag Uniform Crayons Books Bag Uniform Interesting 3-element item-sets: {Bag,Uniform,Crayons} Bag Uniform Pencil Bag Pencil Books Uniform Crayons Bag Bag Pencil Books Crayons Uniform Bag Books Crayons Bag Uniform Crayons Pencil Pencil Uniform Books
  • 18. Mining for Association Rules Association rules are of the form Bag Uniform Crayons AB Books Bag Uniform Bag Uniform Pencil Which are directional… Bag Pencil Books Uniform Crayons Bag Association rule mining requires two Bag Pencil Books thresholds: Crayons Uniform Bag Books Crayons Bag minsup and minconf Uniform Crayons Pencil Pencil Uniform Books
  • 19. Mining for Association Rules Mining association rules using apriori General Procedure: Bag Uniform Crayons 1. Use apriori to generate frequent Books Bag Uniform itemsets of different sizes Bag Uniform Pencil 2. At each iteration divide each frequent Bag Pencil Books itemset X into two parts LHS and Uniform Crayons Bag RHS. This represents a rule of the form LHS  RHS Bag Pencil Books 3. The confidence of such a rule is Crayons Uniform Bag support(X)/support(LHS) Books Crayons Bag 4. Discard all rules whose confidence is Uniform Crayons Pencil less than minconf. Pencil Uniform Books
  • 20. Mining for Association Rules Mining association rules using apriori Example: Bag Uniform Crayons The frequent itemset {Bag, Uniform, Books Bag Uniform Crayons} has a support of 0.3. Bag Uniform Pencil Bag Pencil Books This can be divided into the following Uniform Crayons Bag rules: {Bag}  {Uniform, Crayons} Bag Pencil Books {Bag, Uniform}  {Crayons} Crayons Uniform Bag {Bag, Crayons}  {Uniform} Books Crayons Bag {Uniform}  {Bag, Crayons} Uniform Crayons Pencil {Uniform, Crayons}  {Bag} Pencil Uniform Books {Crayons}  {Bag, Uniform}
  • 21. Mining for Association Rules Mining association rules using apriori Confidence for these rules are as follows: Bag Uniform Crayons {Bag}  {Uniform, Crayons} 0.375 Books Bag Uniform {Bag, Uniform}  {Crayons} 0.6 Bag Uniform Pencil {Bag, Crayons}  {Uniform} 0.75 Bag Pencil Books {Uniform}  {Bag, Crayons} 0.428 Uniform Crayons Bag {Uniform, Crayons}  {Bag} 0.75 {Crayons}  {Bag, Uniform} 0.75 Bag Pencil Books Crayons Uniform Bag If minconf is 0.7, then we have discovered the Books Crayons Bag following rules… Uniform Crayons Pencil Pencil Uniform Books
  • 22. Mining for Association Rules Mining association rules using apriori People who buy a school bag and a set of crayons are likely to buy school Bag Uniform Crayons uniform. Books Bag Uniform Bag Uniform Pencil People who buy school uniform and a set Bag Pencil Books of crayons are likely to buy a school Uniform Crayons Bag bag. Bag Pencil Books People who buy just a set of crayons are Crayons Uniform Bag likely to buy a school bag and school Books Crayons Bag uniform as well. Uniform Crayons Pencil Pencil Uniform Books
  • 23. Generalized Association Rules Since customers can buy any number of items in one transaction, the transaction relation would be in the form of a list of individual purchases. Bill No. Date Item 15563 23.10.2003 Books 15563 23.10.2003 Crayons 15564 23.10.2003 Uniform 15564 23.10.2003 Crayons
  • 24. Generalized Association Rules A transaction for the purposes of data mining is obtained by performing a GROUP BY of the table over various fields. Bill No. Date Item 15563 23.10.2003 Books 15563 23.10.2003 Crayons 15564 23.10.2003 Uniform 15564 23.10.2003 Crayons
  • 25. Generalized Association Rules A GROUP BY over Bill No. would show frequent buying patterns across different customers. A GROUP BY over Date would show frequent buying patterns across different days. Bill No. Date Item 15563 23.10.2003 Books 15563 23.10.2003 Crayons 15564 23.10.2003 Uniform 15564 23.10.2003 Crayons
  • 26. Classification and Clustering Given a set of data elements: Classification maps each data element to one of a set of pre-determined classes based on the difference among data elements belonging to different classes Clustering groups data elements into different groups based on the similarity between elements within a single group
  • 27. Classification Techniques Decision Tree Identification Outlook Temp Play? Classification problem Sunny 30 Yes Overcast 15 No Weather Sunny 16 Yes  Cloudy 27 Yes Play(Yes,No) Overcast 25 Yes Overcast 17 No Cloudy 17 No Cloudy 35 Yes
  • 28. Classification Techniques Hunt’s method for decision tree identification: Given N element types and m decision classes: 1. For i  1 to N do 1. Add element i to the i-1 element item-sets from the previous iteration 2. Identify the set of decision classes for each item-set 3. If an item-set has only one decision class, then that item-set is done, remove that item-set from subsequent iterations 2. done
  • 29. Classification Techniques Decision Tree Identification Example Outlook Temp Play? Sunny Warm Yes Sunny Yes Overcast Chilly No Sunny Chilly Yes Cloudy Yes/No Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Yes/No Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes
  • 30. Classification Techniques Decision Tree Identification Example Outlook Temp Play? Sunny Warm Yes Sunny Yes Overcast Chilly No Sunny Chilly Yes Cloudy Yes/No Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Yes/No Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes
  • 31. Classification Techniques Decision Tree Identification Example Outlook Temp Play? Cloudy Sunny Warm Yes Yes Warm Overcast Chilly No Sunny Chilly Yes Cloudy No Cloudy Pleasant Yes Chilly Overcast Pleasant Yes Cloudy Overcast Chilly No Yes Pleasant Cloudy Chilly No Cloudy Warm Yes
  • 32. Classification Techniques Decision Tree Identification Example Outlook Temp Play? Overcast Sunny Warm Yes Warm Overcast Chilly No Sunny Chilly Yes Overcast No Cloudy Pleasant Yes Chilly Overcast Pleasant Yes Overcast Overcast Chilly No Yes Pleasant Cloudy Chilly No Cloudy Warm Yes
  • 33. Classification Techniques Decision Tree Identification Example Yes/No Cloudy Overcast Sunny Yes/No Yes Yes/No Warm Pleasant Chilly Chilly No Pleasant Yes No Yes Yes
  • 34. Classification Techniques Decision Tree Identification Example • Top down technique for decision tree identification • Decision tree created is sensitive to the order in which items are considered • If an N-item-set does not result in a clear decision, classification classes have to be modeled by rough sets.
  • 35. Other Classification Algorithms Quinlan’s depth-first strategy builds the decision tree in a depth-first fashion, by considering all possible tests that give a decision and selecting the test that gives the best information gain. It hence eliminates tests that are inconclusive. SLIQ (Supervised Learning in Quest) developed in the QUEST project of IBM uses a top-down breadth-first strategy to build a decision tree. At each level in the tree, an entropy value of each node is calculated and nodes having the lowest entropy values selected and expanded.
  • 36. Clustering Techniques Clustering partitions the data set into clusters or equivalence classes. Similarity among members of a class more than similarity among members across classes. Similarity measures: Euclidian distance or other application specific measures.
  • 37. Euclidian Distance for Tables (Overcast,Chilly,Don’t Play) Overcast (Cloudy,Pleasant,Play) Cloudy Don’t Play Play Sunny Warm Pleasant Chilly
  • 38. Clustering Techniques General Strategy: 1. Draw a graph connecting items which are close to one another with edges. 2. Partition the graph into maximally connected subcomponents. 1. Construct an MST for the graph 2. Merge items that are connected by the minimum weight of the MST into a cluster
  • 39. Clustering Techniques Clustering types: Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level Partitional clustering: Clusters are formed at only one level
  • 40. Clustering Techniques Nearest Neighbour Clustering Algorithm: Given n elements x1, x2, … xn, and threshold t, . 1. j  1, k  1, Clusters = {} 2. Repeat 1. Find the nearest neighbour of xj 2. Let the nearest neighbour be in cluster m 3. If distance to nearest neighbour > t, then create a new cluster and k  k+1; else assign xj to cluster m 4. j  j+1 3. until j > n
  • 41. Clustering Techniques Iterative partitional clustering: Given n elements x1, x2, … xn, and k clusters, each with a center. 1. Assign each element to its closest cluster center 2. After all assignments have been made, compute the cluster centroids for each of the cluster 3. Repeat the above two steps with the new centroids until the algorithm converges
  • 42. Mining Sequence Data Characteristics of Sequence Data: • Collection of data elements which are ordered sequences • In a sequence, each item has an index associated with it • A k-sequence is a sequence of length k. Support for sequence j is the number of m-sequences (m>=j) which contain j as a sequence • Sequence data: transaction logs, DNA sequences, patient ailment history, …
  • 43. Mining Sequence Data Some Definitions: • A sequence is a list of itemsets of finite length. • Example: • {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil} • … the purchases of a single customer over time… • The order of items within an itemset does not matter; but the order of itemsets matter • A subsequence is a sequence with some itemsets deleted
  • 44. Mining Sequence Data Some Definitions: • A sequence S’ = {a1, a2, …, am} is said to be contained within another sequence S, if S contains a subsequence {b1, b2, … bm} such that a1 ⊆ b1, a2 ⊆ b2, …, am ⊆ bm. • Hence, {pen}{pencil}{ruler,pencil} is contained in {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
  • 45. Mining Sequence Data Apriori Algorithm for Sequences: 1. L1  Set of all interesting 1-sequences 2. k  1 3. while Lk is not empty do 1. Generate all candidate k+1 sequences 2. Lk+1  Set of all interesting k+1-sequences 3. done
  • 46. Mining Sequence Data Generating Candidate Sequences: Given L1, L2, … Lk, candidate sequences of Lk+1 are generated as follows: For each sequence s in Lk, concatenate s with all new 1- sequences found while generating Lk-1
  • 47. Mining Sequence Data Example: minsup = 0.5 abcde Interesting 1-sequences: bdae a aebd b be d eabda e aaaa baaa Candidate 2-sequences cbdb aa, ab, ad, ae abbab ba, bb, bd, be abde da, db, dd, de ea, eb, ed, ee
  • 48. Mining Sequence Data Example: minsup = 0.5 abcde Interesting 2-sequences: bdae ab, bd aebd be Candidate 2-sequences eabda aba, abb, abd, abe, aaaa aab, bab, dab, eab, baaa bda, bdb, bdd, bde, cbdb bbd, dbd, ebd. abbab abde Interesting 3-sequences = {}
  • 49. Mining Sequence Data Language Inference: Given a set of sequences, consider each sequence as the behavioural trace of a machine, and infer the machine that can display the given sequence as behavior. aabb ababcac abbac … Input set of sequences Output state machine
  • 50. Mining Sequence Data • Inferring the syntax of a language given its sentences • Applications: discerning behavioural patterns, emergent properties discovery, collaboration modeling, … • State machine discovery is the reverse of state machine construction • Discovery is “maximalist” in nature…
  • 51. Mining Sequence Data “Maximal” nature of language inference: a,b,c abc aabc “Most general” state machine aabbc c abbc b c b a a c c b b “Most specific” state machine
  • 52. Mining Sequence Data “Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000) Given a set of n sequences: 1. Create a state machine for the first sequence 2. for j  2 to n do 1. Create a state machine for the jth sequence 2. Merge this sequence into the earlier sequence as follows: 1. Merge all halt states in the new state machine to the halt state in the existing state machine 2. If two or more paths to the halt state share the same suffix, merge the suffixes together into a single path 3. Done
  • 53. Mining Sequence Data “Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000) a a b c b aabcb aac a a b c b c aabc a a b c b c b a a c b
  • 54. Mining Streaming Data Characteristics of streaming data: • Large data sequence • No storage • Often an infinite sequence • Examples: Stock market quotes, streaming audio/video, network traffic
  • 55. Mining Streaming Data Running mean: Let n = number of items read so far, avg = running average calculated so far, On reading the next number num: avg  (n*avg+num) / (n+1) n  n+1
  • 56. Mining Streaming Data Running variance: var = ∑(num-avg)2 = ∑num2 - 2*∑num*avg + ∑avg2 Let A = ∑num2 of all numbers read so far B = 2*∑num*avg of all numbers read so far C = ∑avg2 of all numbers read so far avg = average of numbers read so far n = number of numbers read so far
  • 57. Mining Streaming Data Running variance: On reading next number num: avg  (avg*n + num) / (n+1) n  n+1 A  A + num2 B  B + 2*avg*num C  C + avg2 var = A + B + C
  • 58. Mining Streaming Data γ-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999) Let streaming data be in the form of “frames” where each frame comprises of one or more data elements. Support for data element k within a frame is defined as (#occurrences of k)/(#elements in frame) γ-Consistency for data element k is the “sustained” support for k over all frames read so far, with a “leakage” of (1- γ)
  • 59. Mining Streaming Data γ-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999) γ*sup(k) (1-γ) levelt(k) = (1-γ)*levelt-1(k) + γ*sup(k)
  • 60. Data Warehousing • A platform for online analytical processing (OLAP) • Warehouses collect transactional data from several transactional databases and organize them in a fashion amenable to analysis • Also called “data marts” • A critical component of the decision support system (DSS) of enterprises • Some typical DW queries: – Which item sells best in each region that has retail outlets – Which advertising strategy is best for South India? – Which (age_group/occupation) in South India likes fast food, and which (age_group/occupation) likes to cook?
  • 61. Data Warehousing OLTP Orde r Proc e s s in g Data Cleaning Inventory les Data Sa Warehouse (OLAP)
  • 62. OLTP vs OLAP Transactional Data (OLTP) Analysis Data (OLAP) Small or medium size databases Very large databases Transient data Archival data Frequent insertions and Infrequent updates updates Small query shadow Very large query shadow Normalization important to De-normalization important to handle updates handle queries
  • 63. Data Cleaning • Performs logical transformation of transactional data to suit the data warehouse • Model of operations  model of enterprise • Usually a semi-automatic process
  • 64. Data Cleaning Data Warehouse Orders Order_id Customers Price Products Cust_id Orders Inventory Price Inventory Time Prod_id Price Sales Price_chng Cust_id Cust_prof Tot_sales
  • 65. Multi-dimensional Data Model Price Products rs de Or Customers Jan’01 Jun’01 Jan’02 Jun’02 Time
  • 66. Some MDBMS Operations • Roll-up – Add dimensions • Drill-down – Collapse dimensions • Vector-distance operations (ex: clustering) • Vector space browsing
  • 67. Star Schema Dim Dim Tbl_1 Tbl_1 Dim Fact table Dim Tbl_1 Tbl_1
  • 68. WWW Based References • http://www.kdnuggets.com/ • http://www.megaputer.com/ • http://www.almaden.ibm.com/cs/quest/index.html • http://fas.sfu.ca/cs /research/groups/DB/sections/publication/kdd/kdd.html • http://www.cs.su.oz.au/~thierry/ckdd.html • http://www.dwinfocenter.org/ • http://datawarehouse.itoolbox.com/ • http://www.knowledgestorm.com/ • http://www.bitpipe.com/ • http://www.dw-institute.com/ • http://www.datawarehousing.com/
  • 69. References • Agrawal, R. Srikant: ``Fast Algorithms for Mining Association Rules'', Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. • R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. • R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996. • Surajit Chaudhuri, Umesh Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record. 26(1), March 1997. • Jennifer Widom. Research Problems in Data Warehousing. Proc. of Int’l Conf. On Information and Knowledge Management, 1995.
  • 70. References • A. Shoshani. OLAP and Statistical Databases: Similarities and Differences. Proc. of ACM PODS 1997. • Panos Vassiliadis, Timos Sellis. A Survey on Logical Models for OLAP Databases. ACM SIGMOD Record • M. Gyssens, Laks VS Lakshmanan. A Foundation for Multi- Dimensional Databases. Proc of VLDB 1997, Athens, Greece. • Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions Based on Consistent Patterns. Proc. of CoopIS 1999, Edinburg, UK. • Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000, Como, Italy.