2. Data Warehouse and Data Mining Chapter 5 2
Chapter Objectives Determine an appropriate data mining strategy for a specific problem. Know about several data mining techniques and how each technique builds a generalized model to represent data. Understand how a confusion matrix is used to help evaluate supervised learner models.
3. Data Warehouse and Data Mining Chapter 5 3
Understand basic techniques for evaluating supervised learner models with numeric output. Know how measuring lift can be used to compare the performance of several competing supervised learner models. Understand basic techniques for evaluating unsupervised learner models. Chapter Objectives
4. Data Warehouse and Data Mining Chapter 5 4
Data Mining StrategiesClassificationis probably the best understood of all data mining strategies. Classification tasks have three common characteristics. •Learning is supervised. •The dependent variable is categorical. •The emphasis is on building modelsable to assign new instances to one of a set of well- defined classes.
5. Data Warehouse and Data Mining Chapter 5 5
Data Mining Strategies•Some example classification tasks include the following: •Determine those characteristics that differentiate individuals who have suffered a heart attack from those who have not. •Develop a profile of a “successful” person. •Determine if a credit card purchase is fraudulent. •Classify a car loan applicant as a good or a poor credit risk. •Develop a profile to differentiate female and male stroke victims.
24. Data Warehouse and Data Mining Chapter 5 24
Chapter SummaryData mining strategies include classification, estimation, prediction, unsupervised clustering, and market basket analysis. Classification and estimation strategies are similar in that each strategy is employed to build models able to generalize current outcome. However, the output of a classification strategy is categorical, whereas the output of an estimation strategy is numeric.
25. Data Warehouse and Data Mining Chapter 5 25
Chapter SummaryA predictive strategydiffers from a classification or estimation strategy in that it is used to design models for predicting future outcome rather than current behavior. Unsupervised clusteringstrategies are employed to discover hidden concept structures in data as well as to locate atypical data instances. The purpose of market basket analysisis to find interesting relationships among retail products. Discovered relationships can be used to design promotions, arrange shelf or catalog items, or develop cross- marketing strategies.
26. Data Warehouse and Data Mining Chapter 5 26
A data mining technique applies a data mining strategy to a set of data. Data mining techniques are defined by an algorithm and a knowledge structure. Common features that distinguish the various techniques are whether learning is supervised or unsupervised and whether theiroutput is categorical or numeric. Chapter Summary
27. Data Warehouse and Data Mining Chapter 5 27
Familiar supervised data miningtechniques include decision tree methods, production rule generators, neural networks, and statistical methods. Association rules are a favorite technique for marketing applications. Clustering techniques employ some measure of similarity to group instancesinto disjoint partitions. Clustering methods are frequently used to help determine a best set of input attributes for building supervised learner models. Chapter Summary
28. Data Warehouse and Data Mining Chapter 5 28
Chapter SummaryPerformance evaluationis probably the most critical of all the steps in the data mining process. Supervised model evaluation is often performed using a training/test set scenario. Supervised models with numeric output can be evaluated by computing average absolute or average squared error differences between computed and desired outcome.
29. Data Warehouse and Data Mining Chapter 5 29
Chapter SummaryMarketing applications that focus on mass mailings are interested in developing models for increasing response rates to promotions. A marketing application measures the goodness of a model by its ability to lift response rate thresholds to levels well above those achieved by naïve (mass) mailing strategies. Unsupervised models support some measure of cluster qualitythat can be used for evaluative purposes. Supervised learning can also be employed to evaluate the quality of the clusters formedby an unsupervised model.
30. Data Warehouse and Data Mining Chapter 5 30
Key TermsClassification. A supervised learning strategy where the output attribute is categorical. The emphasis is on building models able to assign new instances to one of a set of well-defined classes. Association rule.A production rule whose consequent may contain multiple conditions and attribute relationships. An output attribute in one association rule can be an input attribute in other rule. Confusion matrix.A matrix used to summarize the results of a supervised classification. Entries along the main diagonal represent the total number of correct classifications. Entries other than those on the main diagonal represent classification errors.
31. Data Warehouse and Data Mining Chapter 5 31
Key TermsDatamining strategy.An outline of an approach for problem solution. Data mining technique.One or more algorithms together with an associated knowledge structure. Dependent variable.A variable whose value is determined by a combination of one or more independent variables. Estimation.A supervised learning strategy where the output attribute is numeric. Emphasis is on determining current rather than future outcome.
32. Data Warehouse and Data Mining Chapter 5 32
Key TermsIndependent variable.An input attribute used for building supervised or unsupervised learner models. Lift.The probability of class Cigiven a sample taken from population Pdivided by the probability of Cigiven the entire population P. Lift chart.A graph that displays the performance of a data mining model as a function of sample size. Linear regression.A supervised learning technique that generalizes numeric data as a linear equation. The equation defines the value of an output attribute as a linear sum of weighted input attribute values.
33. Data Warehouse and Data Mining Chapter 5 33
Key TermsMarket basket analysis.A data mining strategy that attempts to find interesting relationships among retail products. Mean absolute error.For a set of training or test set instances, the mean absolute error is the average absolute difference between classifier predicted output and actual output. Mean squared error.For a set of training or test set instances, the mean squared error is the average of the sum of squared differences between classifier predicted output and actual output. Neural network.A set of interconnected nodes designed to imitate the functioning of the human brain.
34. Data Warehouse and Data Mining Chapter 5 34
Key TermsOutliers.Atypical data instances. Prediction.A supervised learning strategy designed to determine future outcome. Root mean squared error.The square root of the mean squared error. Rule Maker.A supervised learner model for generating production rules from data. Statistical regression.A supervised learning technique that generalizes numerical data as a mathematical equation. The equation defines the value of an output attribute as a sum of weighted input attribute values.