This document provides an overview of key aspects of data preparation and processing for data mining. It discusses the importance of domain expertise in understanding data. The goals of data preparation are identified as cleaning missing, noisy, and inconsistent data; integrating data from multiple sources; transforming data into appropriate formats; and reducing data through feature selection, sampling, and discretization. Common techniques for each step are outlined at a high level, such as binning, clustering, and regression for handling noisy data. The document emphasizes that data preparation is crucial and can require 70-80% of the effort for effective real-world data mining.
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
Data preparation and processing chapter 2
1. Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 1
2. Outline
Introduction
Domain Expert
Goal identification and Data Understanding
Data Cleaning
Missing values
Noisy Data
Inconsistent Data
Data Integration
Data Transformation
Data Reduction
Feature Selection
Sampling
Discretization
4. The real –world database typically used in data
mining may have millions of records and thousands of
variables. It is noisy and has missing and inconsistent
values.
Data quality is a key issue with data mining so data
preparation is a necessary step for serious, effective,
real-world data mining.
Introduction
5. To increase the accuracy of the mining, has to
perform data preprocessing.
Otherwise, garbage in => garbage out
Data Preparation estimated to take 70-80% of the
time and effort.
Introduction
6. Domain Expertise
Data quality expert: “We found these strange records
in your database after running sophisticated
algorithms!”
Domain Experts: “Oh, those apples - we put them
in the same baskets as oranges because there are too
few apples to bother. Not a big deal. We knew that
already.”
7. Domain Expertise
Domain Expertise is important for understanding the
data, the problem and interpreting the results.
“The counter resets to 0 if the number of calls exceeds N”.
“The missing values are represented by 0, but the default billed
amount is 0 too.”
Insufficient Domain Expertise is a primary cause of
poor Data Quality– data are unusable.
8. Goal Identification
To obtain the highest benefit from data mining, there
must be a clear statement of the business objectives.
The first and most important step in any targeting-
model project is to establish a clear goal and develop a
process to achieve that goal.
9. Goal Identification
Example of Goal for business company are:
You want to attract new customers
You want to avoid high -risk customers
You want to understand the characteristics of your current customers?
You want to make your unprofitable customers more profitable?
You want to retain your profitable customers?
You want to win back your lost customers?
You want to improve customer satisfaction?
You want to increase sales?
You want to reduce expenses
10. Data Understanding
Starts with an initial data collection and proceeds with
activities in order to get familiar with the data, to
identify data quality problems, to discover first closes
into the data.
11. Data Understanding
Data Understanding: Relevance:
What data is available for the task?
Is this data relevant?
Is additional relevant data available?
How much historical data is available?
Who is the data expert ?
12. Data Understanding
Data Understanding: Quantity
Number of instances (records)
Rule of thumb: 5,000 or more desired
if less, results are less reliable;
Number of attributes (fields)
Rule of thumb: for each field, 10 or more instances
If more fields, use feature reduction and selection
Number of targets
Rule of thumb: >100 for each class
if very unbalanced, use stratified sampling
14. Data Cleaning
Tid Refund
Marital
Status
Taxable
Income
Cheat
1 Yes 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced -95k Yes
6 No Married 60K No
7 Yes 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
15. Data Cleaning
Real-world data tends to be incomplete, noisy and
inconsistent.
Data Cleaning Steps
Missing values
Noisy Data
Inconsistent Data
16. Missing values
A missing value (Mv) is an empty cell in the table
that represents a dataset.
?Instances
Attributes
17. Dealing with missing values
1. Ignore records with missing values:
This is usually done when the class label is missing.
This method is not effective, unless the record contains
several attributes with missing values.
18. Dealing with missing values
2. Fill in the missing value manually:
In general, this approach is time-consuming and may be not
feeble given a large data set with many missing values.
3. Fill in the missing value manually:
Replace all missing values by same constant such as
“unknown”. Although this method is simple but it is not
recommended because results with “unknown values are not
“interesting”.
19. Dealing with missing values
4. Use the attribute mean to fill missing values:
For example in attribute income if the mean income is 28000,
use this value to replace the missing values.
5. Use the attribute mean for all samples belonging to the
same class
For example, if classifying customers according to credit risk,
replace the missing value with the mean income value for
customers in the same credit risk category as that of the given
record.
20. Dealing with missing values
6. Use advanced method
such as K-nearest neighbors formalism or decision
tree to predict the missing value using other values.
21. Dealing with missing values
k nearest neighbors Approach
Compute the k nearest neighbors and assign a value
from them.
22. Dealing with missing values
k nearest neighbors Approach
For nominal values, use the most common value
among all neighbors.
For numerical values use the average value.
Indeed, we need to define a proximity measure
between instances, such as euclidian distance.
24. Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 2
25. Outline
Introduction
Domain Expert
Goal identification and Data Understanding
Data Cleaning
Missing values
Noisy Data
Inconsistent Data
Data Integration
Data Transformation
Data Reduction
Feature Selection
Sampling
Discretization
27. Noise is a random error in measured variable.
Noisy data is meaningless data.
Any data that has been received, stored or changed
in such a manner that it cannot be read or used by the
program that originally created it can be described as
noisy.
Noisy Data
28. Source of Noisy data:
1. Data entry problem.
2. Faulty data collection instruments.
3. Data transmission.
Noisy Data
29. Binning method
Clustering
Combined computer and human inspections
Regression
How to handle noisy data ?
30. How to handle noisy data ?
Binning method:
1. Sort data
2. Partition into equal-frequency groups.
3. One can smooth by group means, smooth by
group median, smooth by group boundaries, etc.
31. How to handle noisy data ?
Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equal-frequency) groups:
-G1: 4, 8, 9, 15
-G2: 21, 21, 24, 25
-G3: 26, 28, 29, 34
Smoothing by bin means:
-G1: 9, 9, 9, 9
-G2: 23, 23, 23, 23
-G3: 29, 29, 29, 29
Smoothing by bin boundaries:
-G1: 4, 4, 4, 15
-G2: 21, 21, 25, 25
-G3: 26, 26, 26, 34
32. How to handle noisy data ?
Clustering: Outliers may be detected by clustering,
where similar values are organized into groups, values
that fall outside the set of clusters may be considered
outliers.
33. How to handle noisy data ?
Combined computer and human inspections: Outliers
may be identified by detect suspicious values and
check by human.
34. How to handle noisy data ?
Regression: Data can be smoothed by fitting the
data to a function.
35. Inconsistent Data
Data which is inconsistent with our models, should
be dealt with.
Common sense can also be used to detect such kind
of inconsistency:
The same name occurring differently in an application.
Different names can appear to be the same (Dennis Vs
Denis)
Inappropriate values (Males being pregnant, or having an
negative age) Was rating “1,2,3”, now rating “A, B, C”
Difference between duplicate records
36. Inconsistent Data
We want to transform all dates to the same format internally
Some systems accept dates in many formats
e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
dates are transformed internally to a standard value
Frequently, just the year (YYYY) is sufficient
For more details, we may need the month, the day, the hour,
etc
Representing date as YYYYMM or YYYYMMDD can be OK.
38. Data Integration
Combines data from multiple sources into a coherent
store.
Increasingly data a mining projects require data
from more than one data source.
Such as multiple databases, data warehouse, flat
files and historical data.
39. Data Integration
Data is stored in many systems across enterprise
and outside the enterprise
The source of data fall into two categories:
Internal sources that are generated through enterprise
activities such as databases, historical data, Web sites
and warehouses.
External sources such as credit bureaus, phone
companies and demographical information.
40. Data Integration
Data Warehouse: is a structure that links information
from two or more databases.
Data warehouse brings data from different data
sources into a central repository.
It performs some data integration, clean-up, and
summarization, and distribute the information data
marts.
43. Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 3
44. Outline
Introduction
Domain Expert
Goal identification and Data Understanding
Data Cleaning
Missing values
Noisy Data
Inconsistent Data
Data Integration
Data Transformation
Data Reduction
Feature Selection
Sampling
Discretization
47. Definition 1: Transform the data into a form
appropriate for given data mining method.
Definition 2: Data transformation is the process of
converting data or information from one format to
another, usually from the format of a source system
into the required format of a new destination system.
Data Transformation
52. Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 4
53. Outline
Introduction
Domain Expert
Goal identification and Data Understanding
Data Cleaning
Missing values
Noisy Data
Inconsistent Data
Data Integration
Data Transformation
Data Reduction
Feature Selection
Sampling
Discretization
57. Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to run
on the complete data set.
Data reduction: Obtains a reduced representation of
the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical
results.
Data Reduction
58. The choice of data representation, and selection,
reduction or transformation of features is probably the
most important issue that determines the quality of a
data-mining solution.
Data Reduction
59. The three basic operations in a data-reduction
process are:
Delete a column (feature selection).
Delete a row (sampling).
Reduce the number of values in a column
(Discretization).
Data Reduction
60. Feature Selection
We want to choose features (attributes) that are
relevant to our data-mining application in order to
achieve maximum performance with the minimum
measurement and processing effort.
61. Feature Selection
1. Redundant features
Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid.
62. Feature Selection
2. Irrelevant features
Contain no information that is useful for the data
mining task at hand.
E.g., students' ID is often irrelevant to the task of
predicting students' GPA.
63. Feature Selection
3. Selecting Most Relevant Fields
If there are too many fields, select a subset that is most
relevant.
Can select top N fields using some computations.
What is good N?
Rule of thumb -- keep top 50 fields
64. Feature Selection
Two types of feature selection
Unsupervised: Reduce fields without knowing class label.
Supervised: Select fields with respect to class label.
65. Sampling
Sampling: Obtaining a small sample s to represent
the whole data set N.
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data.
66. Sampling
Key principle: Choose a representative subset of the
data.
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods, e.g., stratified
sampling.
68. Types of Sampling
Sampling without replacement:
Once an object is selected, it is removed from the population.
Sampling with replacement
A selected object is not removed from the population.
Stratified sampling:
Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
73. Discretization
Discretization is very useful for generating a
summary of data, also called “binning”.
It does not use the class information.
Suppose we have the following set of values for the
attribute - AGE : 0, 4, 12, 16, 16, 18, 24, 26, 28.
Two possible ways in which Binning can be applied
are: Equi-width binning or Equi-frequency binning .