2. Data Mining: A KDD Process
Data mining: the core of
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection &
Transformation
Data Mining
Pattern Evaluation
2
3. Steps of a KDD Process
Data Cleaning
Handles Noisy, Inconsistent, Incomplete data
Missing Values
Noisy data
Binning, Clustering etc.
Inconsistencies
Tools, functional dependencies
3
4. Data Integration
Schema Integration
Entity Identification problem
Redundancy
Correlation Analysis
Data Selection
Select Only the task relevant data
Steps of a KDD Process
4
5. Data Transformation
Transform or consolidate data
Smoothing, Normalization, Feature Construction
Data Reduction - Compression
Data Mining
Intelligent methods are applied to extract patterns
Steps of a KDD Process
5
6. Pattern Evaluation
Interestingness Measures
Knowledge Presentation
Visualization
Steps of a KDD Process
6
7. Data Mining Functionalities
Descriptive
Characterize general properties of the data
Predictive
Performs inference
Mining
Parallel
Various Granularities
7
8. Data Mining Functionalities
Concept/class description
Association Analysis
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis
8
9. Concept/ Class Description
Data can be associated with Classes /
Concepts
Computers, Printers
BigSpenders Vs BudgetSpenders
Class / Concept Description
Classes and Concepts can be summarized in
concise and precise terms
Data Characterization
Data Discrimination
9
10. Data Characterization
Summarization of the general characteristics
Data collected and aggregated
OLAP roll up operation
Attribute Oriented Induction
Results – Charts, cubes, rules
Example
Characteristics of Customers
10
11. Data Discrimination
Compare target class and contrasting classes
Maybe user specified
Examples:
Products whose sales increased Vs decreased
Regular Shoppers Vs Occasional Shoppers
Output includes Comparative measures
11
12. Association Analysis
Discovery of association rules
Form: X ⇒ Y
Multi-dimensional
Age(X, “20…29”) ∧ income(X, “20K…25K”) ⇒
buys(X, “Laptop”)
Single Dimensional
buys(X, “Laptop”) ⇒ buys(X, “Software”)
12
13. Classification and Prediction
Classification
Finds models that describe and differentiate
classes or concepts
Predicts class
Training data
Models – rules, decision trees, NN, formulae
Preceded by relevance analysis (to eliminate
irrelevant attributes)
13
14. Classification and Prediction
Prediction
Derived model is used for prediction
Data value prediction
Class label prediction (Classification)
Trend identification
14
15. Cluster Analysis
Unsupervised
Class labels are missing in the training set
Maximize Intra-class similarity
Minimize Inter-class similarity
Hierarchy of classes
15
16. Outlier Analysis
Objects that do not comply with the general
behavior
Noise Vs Rare events
Fraud detection
Statistical tests
Deviation based methods
16