2. DATA MINING PROCESSES- STANDARD
PROCESSES
Crisp – DM
Cross-Industry Standard Process for Data Mining
Semma
Is specific to SAS
3. Cross-Industry Standard Process
for Data Mining (CRISP-DM)
provides an overview of the life
cycle of a data mining project.
Six phases:
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
Phases of the CRISP-DM Process Model
4. CRISP- DM
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment
5. 1 BUSINESS UNDERSTANDING
Includes:
Determining business objectives
A managerial need for new knowledge
What types of customers are interested in each of our
products?
What are typical profiles of our customers and how much
value do each of them provide to us
Assessing the current situation
Establishing data mining goals
Developing a project plan including a budget
6. 2 DATA UNDERSTANDING
Selectthe data
Three important issues
Set up a concise and clear description of the problem
Identify the relevant data for the problem description
(The selected variables should be independent of each
other, depends on the method)
Types of data
Demographic data – income, education, gender etc
Socio-graphic data – hobbies, club memberships etc
Transactional data –sales records, credit card spending etc.
Quantitative data – numerical values
Qualitative data – contains nominal and ordinal data
7. SCALES
Nominal – no order between data points - gender
Ordinal – order between data points – ranking
results
Interval – order between data points and equal
distances between measurements – no true zero
point
Ratio – an interval scale with a true zero point –
Sales has doubled - sales previous month 1 milj.,
this month 2 milj.
Question: Is the Likert scale an ordinal or
interval scale?
8. 3 DATA PREPARATION
Cleandata for better quality
Convert data to be consistent
Treatment of missing values
Redundant data
Determine the data types:
In SPSS Modeler the following data types are used
RANGE Numeric values (integer, real)
FLAG Binary (yes/no, 0/1)
SET Data with distinct multiple values, (string)
TYPELESS For other types of data
9. 4 MODELING
Data treatment
Training set, validation set, test set
Data mining techniques
Association
Classification
Clustering-segmentation
Predictions
Sequential Patterns
Similar Time Sequences
10. 5 EVALUATION
How
to recognize the business value from
knowledge discovered.
A puzzle to be solved between data analysts, business
analysts and decision makers
Which visualization tool to use
Pie charts, histograms, box plots, scatter plots, self-
organizing maps
12. SEMMA (BY THE SAS INSTITUTE)
Sample
Explore
Modify
Model
Assess
See
http://www.sas.com/offices/europe/uk/technologies/
analytics/datamining/miner/semma.html
13. AN APPLICATION EXAMPLE (CRISP – DM)
Topredict which customers would be insolvent early
enough for the firm to take preventive actions
Billing
period was 2 months
Customers used their phone for 4 weeks
Received bill about 1 week later
Payment was due 30 days after receiving the bill
Actions if bill not paid before 14 days after due date.
Phone disconnected if bill exceeded a certain amount
Hypothesis: Customer’s change their calling
behaviour before becoming insolvent
14. EXAMPLE CONT.
Data 100 000 customers
17 month period
Discriminant Analysis, decision trees and neural
networks were used
2066 cases
46 initial variables
Costs were allocated to misclassification errors
Final result:
89.8 % correctly classified with test data and a cost
function = 360 € compared to 14 580 € in the first
run.