SlideShare uma empresa Scribd logo
1 de 74
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 1
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
 The real –world database typically used in data
mining may have millions of records and thousands of
variables. It is noisy and has missing and inconsistent
values.
Data quality is a key issue with data mining so data
preparation is a necessary step for serious, effective,
real-world data mining.
Introduction
To increase the accuracy of the mining, has to
perform data preprocessing.
Otherwise, garbage in => garbage out
Data Preparation estimated to take 70-80% of the
time and effort.
Introduction
Domain Expertise
 Data quality expert: “We found these strange records
in your database after running sophisticated
algorithms!”
 Domain Experts: “Oh, those apples - we put them
in the same baskets as oranges because there are too
few apples to bother. Not a big deal. We knew that
already.”
Domain Expertise
Domain Expertise is important for understanding the
data, the problem and interpreting the results.
“The counter resets to 0 if the number of calls exceeds N”.
“The missing values are represented by 0, but the default billed
amount is 0 too.”
Insufficient Domain Expertise is a primary cause of
poor Data Quality– data are unusable.
Goal Identification
 To obtain the highest benefit from data mining, there
must be a clear statement of the business objectives.
 The first and most important step in any targeting-
model project is to establish a clear goal and develop a
process to achieve that goal.
Goal Identification
 Example of Goal for business company are:
 You want to attract new customers
 You want to avoid high -risk customers
 You want to understand the characteristics of your current customers?
 You want to make your unprofitable customers more profitable?
 You want to retain your profitable customers?
 You want to win back your lost customers?
 You want to improve customer satisfaction?
 You want to increase sales?
 You want to reduce expenses
Data Understanding
 Starts with an initial data collection and proceeds with
activities in order to get familiar with the data, to
identify data quality problems, to discover first closes
into the data.
Data Understanding
Data Understanding: Relevance:
 What data is available for the task?
 Is this data relevant?
 Is additional relevant data available?
 How much historical data is available?
 Who is the data expert ?
Data Understanding
Data Understanding: Quantity
 Number of instances (records)
 Rule of thumb: 5,000 or more desired
 if less, results are less reliable;
 Number of attributes (fields)
 Rule of thumb: for each field, 10 or more instances
 If more fields, use feature reduction and selection
 Number of targets
 Rule of thumb: >100 for each class
 if very unbalanced, use stratified sampling
Data Cleaning
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction
Data Cleaning
Tid Refund
Marital
Status
Taxable
Income
Cheat
1 Yes 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced -95k Yes
6 No Married 60K No
7 Yes 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Data Cleaning
 Real-world data tends to be incomplete, noisy and
inconsistent.
 Data Cleaning Steps
 Missing values
 Noisy Data
 Inconsistent Data
Missing values
 A missing value (Mv) is an empty cell in the table
that represents a dataset.
?Instances
Attributes
Dealing with missing values
1. Ignore records with missing values:
 This is usually done when the class label is missing.
 This method is not effective, unless the record contains
several attributes with missing values.
Dealing with missing values
2. Fill in the missing value manually:
In general, this approach is time-consuming and may be not
feeble given a large data set with many missing values.
3. Fill in the missing value manually:
Replace all missing values by same constant such as
“unknown”. Although this method is simple but it is not
recommended because results with “unknown values are not
“interesting”.
Dealing with missing values
4. Use the attribute mean to fill missing values:
For example in attribute income if the mean income is 28000,
use this value to replace the missing values.
5. Use the attribute mean for all samples belonging to the
same class
For example, if classifying customers according to credit risk,
replace the missing value with the mean income value for
customers in the same credit risk category as that of the given
record.
Dealing with missing values
6. Use advanced method
such as K-nearest neighbors formalism or decision
tree to predict the missing value using other values.
Dealing with missing values
k nearest neighbors Approach
Compute the k nearest neighbors and assign a value
from them.
Dealing with missing values
k nearest neighbors Approach
 For nominal values, use the most common value
among all neighbors.
 For numerical values use the average value.
 Indeed, we need to define a proximity measure
between instances, such as euclidian distance.
Next:
Data Cleaning: Noisy Data
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 2
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
 Noise is a random error in measured variable.
 Noisy data is meaningless data.
 Any data that has been received, stored or changed
in such a manner that it cannot be read or used by the
program that originally created it can be described as
noisy.
Noisy Data
 Source of Noisy data:
1. Data entry problem.
2. Faulty data collection instruments.
3. Data transmission.
Noisy Data
 Binning method
 Clustering
 Combined computer and human inspections
 Regression
How to handle noisy data ?
How to handle noisy data ?
 Binning method:
1. Sort data
2. Partition into equal-frequency groups.
3. One can smooth by group means, smooth by
group median, smooth by group boundaries, etc.
How to handle noisy data ?
Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equal-frequency) groups:
-G1: 4, 8, 9, 15
-G2: 21, 21, 24, 25
-G3: 26, 28, 29, 34
Smoothing by bin means:
-G1: 9, 9, 9, 9
-G2: 23, 23, 23, 23
-G3: 29, 29, 29, 29
Smoothing by bin boundaries:
-G1: 4, 4, 4, 15
-G2: 21, 21, 25, 25
-G3: 26, 26, 26, 34
How to handle noisy data ?
Clustering: Outliers may be detected by clustering,
where similar values are organized into groups, values
that fall outside the set of clusters may be considered
outliers.
How to handle noisy data ?
 Combined computer and human inspections: Outliers
may be identified by detect suspicious values and
check by human.
How to handle noisy data ?
 Regression: Data can be smoothed by fitting the
data to a function.
Inconsistent Data
 Data which is inconsistent with our models, should
be dealt with.
 Common sense can also be used to detect such kind
of inconsistency:
The same name occurring differently in an application.
Different names can appear to be the same (Dennis Vs
Denis)
Inappropriate values (Males being pregnant, or having an
negative age) Was rating “1,2,3”, now rating “A, B, C”
Difference between duplicate records
Inconsistent Data
 We want to transform all dates to the same format internally
 Some systems accept dates in many formats
 e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
 dates are transformed internally to a standard value
 Frequently, just the year (YYYY) is sufficient
 For more details, we may need the month, the day, the hour,
etc
 Representing date as YYYYMM or YYYYMMDD can be OK.
Data Integration
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction
Data Integration
 Combines data from multiple sources into a coherent
store.
 Increasingly data a mining projects require data
from more than one data source.
 Such as multiple databases, data warehouse, flat
files and historical data.
Data Integration
 Data is stored in many systems across enterprise
and outside the enterprise
The source of data fall into two categories:
 Internal sources that are generated through enterprise
activities such as databases, historical data, Web sites
and warehouses.
 External sources such as credit bureaus, phone
companies and demographical information.
Data Integration
 Data Warehouse: is a structure that links information
from two or more databases.
 Data warehouse brings data from different data
sources into a central repository.
 It performs some data integration, clean-up, and
summarization, and distribute the information data
marts.
Data Integration
Next:
Data Cleaning: Noisy Data
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 3
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
Data Transformation
 Definition 1: Transform the data into a form
appropriate for given data mining method.
 Definition 2: Data transformation is the process of
converting data or information from one format to
another, usually from the format of a source system
into the required format of a new destination system.
Data Transformation
 Methods include:
 Smoothing
 Aggregation
 Generalization
 Normalization (min-max)
Data Transformation
Methods of Data Transformation
 Normalization: Where the attributes are scaled so as to
fall within a small specified ranges such as -1.0 to 1.0.
How to handle noisy data ?
Next:
Data Reduction
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 4
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
Goal
identification and
Data
Understanding
Data Cleaning Data Integration
Data TransformationData Reduction
Data Reduction
Data Reduction (Selection)
 Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to run
on the complete data set.
 Data reduction: Obtains a reduced representation of
the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical
results.
Data Reduction
 The choice of data representation, and selection,
reduction or transformation of features is probably the
most important issue that determines the quality of a
data-mining solution.
Data Reduction
 The three basic operations in a data-reduction
process are:
 Delete a column (feature selection).
 Delete a row (sampling).
 Reduce the number of values in a column
(Discretization).
Data Reduction
Feature Selection
 We want to choose features (attributes) that are
relevant to our data-mining application in order to
achieve maximum performance with the minimum
measurement and processing effort.
Feature Selection
1. Redundant features
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid.
Feature Selection
2. Irrelevant features
 Contain no information that is useful for the data
mining task at hand.
E.g., students' ID is often irrelevant to the task of
predicting students' GPA.
Feature Selection
3. Selecting Most Relevant Fields
 If there are too many fields, select a subset that is most
relevant.
Can select top N fields using some computations.
What is good N?
 Rule of thumb -- keep top 50 fields
Feature Selection
 Two types of feature selection
 Unsupervised: Reduce fields without knowing class label.
Supervised: Select fields with respect to class label.
Sampling
 Sampling: Obtaining a small sample s to represent
the whole data set N.
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data.
Sampling
 Key principle: Choose a representative subset of the
data.
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling.
Sampling
8000 points 2000 Points 500 Points
Sample Size
Types of Sampling
 Sampling without replacement:
 Once an object is selected, it is removed from the population.
 Sampling with replacement
 A selected object is not removed from the population.
 Stratified sampling:
 Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
Types of Sampling(Sampling without replacement)
Raw Data
Types of Sampling(Sampling with replacement)
Raw Data
Types of Sampling
Raw Data Cluster/Stratified Sample
Types of Sampling
Age
Young
Young
Young
Young
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Senior
Senior
Age
Young
Young
Middle-age
Middle-age
Middle-age
Middle-age
Senior
Discretization
 Discretization is very useful for generating a
summary of data, also called “binning”.
 It does not use the class information.
 Suppose we have the following set of values for the
attribute - AGE : 0, 4, 12, 16, 16, 18, 24, 26, 28.
Two possible ways in which Binning can be applied
are: Equi-width binning or Equi-frequency binning .
Next:
Practical Part

Mais conteúdo relacionado

Mais procurados

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningShahar Cohen
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Simplilearn
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)Sharayu Patil
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic conceptsKrish_ver2
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
Bi Architecture And Conceptual Framework
Bi Architecture And Conceptual FrameworkBi Architecture And Conceptual Framework
Bi Architecture And Conceptual FrameworkSlava Kokaev
 

Mais procurados (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data science
Data scienceData science
Data science
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data literacy
Data literacyData literacy
Data literacy
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Bi Architecture And Conceptual Framework
Bi Architecture And Conceptual FrameworkBi Architecture And Conceptual Framework
Bi Architecture And Conceptual Framework
 
Introduction to Tableau
Introduction to Tableau Introduction to Tableau
Introduction to Tableau
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Semelhante a Data preparation and processing chapter 2

4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processingMahmoud Alfarra
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2Mahmoud Alfarra
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdfLellaLinton
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Data mining
Data miningData mining
Data miningSilicon
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfvenkatakeerthi3
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1Roger Barga
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptkannaradhas
 

Semelhante a Data preparation and processing chapter 2 (20)

4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Data processing
Data processingData processing
Data processing
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Data mining
Data miningData mining
Data mining
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1
 
Data analytics
Data analyticsData analytics
Data analytics
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
My3prep
My3prepMy3prep
My3prep
 

Mais de Mahmoud Alfarra

Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2Mahmoud Alfarra
 
Computer Programming, Loops using Java
Computer Programming, Loops using JavaComputer Programming, Loops using Java
Computer Programming, Loops using JavaMahmoud Alfarra
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structureMahmoud Alfarra
 
Chapter9 graph data structure
Chapter9  graph data structureChapter9  graph data structure
Chapter9 graph data structureMahmoud Alfarra
 
Chapter 8: tree data structure
Chapter 8:  tree data structureChapter 8:  tree data structure
Chapter 8: tree data structureMahmoud Alfarra
 
Chapter 7: Queue data structure
Chapter 7:  Queue data structureChapter 7:  Queue data structure
Chapter 7: Queue data structureMahmoud Alfarra
 
Chapter 6: stack data structure
Chapter 6:  stack data structureChapter 6:  stack data structure
Chapter 6: stack data structureMahmoud Alfarra
 
Chapter 5: linked list data structure
Chapter 5: linked list data structureChapter 5: linked list data structure
Chapter 5: linked list data structureMahmoud Alfarra
 
Chapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structureChapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structureMahmoud Alfarra
 
Chapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structureChapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structureMahmoud Alfarra
 
Chapter 2: array and array list data structure
Chapter 2: array and array list  data structureChapter 2: array and array list  data structure
Chapter 2: array and array list data structureMahmoud Alfarra
 
Chapter1 intro toprincipleofc#_datastructure_b_cs
Chapter1  intro toprincipleofc#_datastructure_b_csChapter1  intro toprincipleofc#_datastructure_b_cs
Chapter1 intro toprincipleofc#_datastructure_b_csMahmoud Alfarra
 
Chapter 0: introduction to data structure
Chapter 0: introduction to data structureChapter 0: introduction to data structure
Chapter 0: introduction to data structureMahmoud Alfarra
 
8 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 201020118 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 20102011Mahmoud Alfarra
 
7 programming-using-java decision-making220102011
7 programming-using-java decision-making2201020117 programming-using-java decision-making220102011
7 programming-using-java decision-making220102011Mahmoud Alfarra
 
6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-Mahmoud Alfarra
 
5 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop201020115 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop20102011Mahmoud Alfarra
 
4 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava201020114 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava20102011Mahmoud Alfarra
 
3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computerMahmoud Alfarra
 

Mais de Mahmoud Alfarra (20)

Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2
 
Computer Programming, Loops using Java
Computer Programming, Loops using JavaComputer Programming, Loops using Java
Computer Programming, Loops using Java
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structure
 
Chapter9 graph data structure
Chapter9  graph data structureChapter9  graph data structure
Chapter9 graph data structure
 
Chapter 8: tree data structure
Chapter 8:  tree data structureChapter 8:  tree data structure
Chapter 8: tree data structure
 
Chapter 7: Queue data structure
Chapter 7:  Queue data structureChapter 7:  Queue data structure
Chapter 7: Queue data structure
 
Chapter 6: stack data structure
Chapter 6:  stack data structureChapter 6:  stack data structure
Chapter 6: stack data structure
 
Chapter 5: linked list data structure
Chapter 5: linked list data structureChapter 5: linked list data structure
Chapter 5: linked list data structure
 
Chapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structureChapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structure
 
Chapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structureChapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structure
 
Chapter 2: array and array list data structure
Chapter 2: array and array list  data structureChapter 2: array and array list  data structure
Chapter 2: array and array list data structure
 
Chapter1 intro toprincipleofc#_datastructure_b_cs
Chapter1  intro toprincipleofc#_datastructure_b_csChapter1  intro toprincipleofc#_datastructure_b_cs
Chapter1 intro toprincipleofc#_datastructure_b_cs
 
Chapter 0: introduction to data structure
Chapter 0: introduction to data structureChapter 0: introduction to data structure
Chapter 0: introduction to data structure
 
3 classification
3  classification3  classification
3 classification
 
8 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 201020118 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 20102011
 
7 programming-using-java decision-making220102011
7 programming-using-java decision-making2201020117 programming-using-java decision-making220102011
7 programming-using-java decision-making220102011
 
6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-
 
5 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop201020115 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop20102011
 
4 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava201020114 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava20102011
 
3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer
 

Último

Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 

Data preparation and processing chapter 2

  • 1. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 1
  • 2. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 4.  The real –world database typically used in data mining may have millions of records and thousands of variables. It is noisy and has missing and inconsistent values. Data quality is a key issue with data mining so data preparation is a necessary step for serious, effective, real-world data mining. Introduction
  • 5. To increase the accuracy of the mining, has to perform data preprocessing. Otherwise, garbage in => garbage out Data Preparation estimated to take 70-80% of the time and effort. Introduction
  • 6. Domain Expertise  Data quality expert: “We found these strange records in your database after running sophisticated algorithms!”  Domain Experts: “Oh, those apples - we put them in the same baskets as oranges because there are too few apples to bother. Not a big deal. We knew that already.”
  • 7. Domain Expertise Domain Expertise is important for understanding the data, the problem and interpreting the results. “The counter resets to 0 if the number of calls exceeds N”. “The missing values are represented by 0, but the default billed amount is 0 too.” Insufficient Domain Expertise is a primary cause of poor Data Quality– data are unusable.
  • 8. Goal Identification  To obtain the highest benefit from data mining, there must be a clear statement of the business objectives.  The first and most important step in any targeting- model project is to establish a clear goal and develop a process to achieve that goal.
  • 9. Goal Identification  Example of Goal for business company are:  You want to attract new customers  You want to avoid high -risk customers  You want to understand the characteristics of your current customers?  You want to make your unprofitable customers more profitable?  You want to retain your profitable customers?  You want to win back your lost customers?  You want to improve customer satisfaction?  You want to increase sales?  You want to reduce expenses
  • 10. Data Understanding  Starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first closes into the data.
  • 11. Data Understanding Data Understanding: Relevance:  What data is available for the task?  Is this data relevant?  Is additional relevant data available?  How much historical data is available?  Who is the data expert ?
  • 12. Data Understanding Data Understanding: Quantity  Number of instances (records)  Rule of thumb: 5,000 or more desired  if less, results are less reliable;  Number of attributes (fields)  Rule of thumb: for each field, 10 or more instances  If more fields, use feature reduction and selection  Number of targets  Rule of thumb: >100 for each class  if very unbalanced, use stratified sampling
  • 13. Data Cleaning Goal identification & Data Understanding Data Cleaning Data Integration Data Transformation Data Reduction
  • 14. Data Cleaning Tid Refund Marital Status Taxable Income Cheat 1 Yes 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced -95k Yes 6 No Married 60K No 7 Yes 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 15. Data Cleaning  Real-world data tends to be incomplete, noisy and inconsistent.  Data Cleaning Steps  Missing values  Noisy Data  Inconsistent Data
  • 16. Missing values  A missing value (Mv) is an empty cell in the table that represents a dataset. ?Instances Attributes
  • 17. Dealing with missing values 1. Ignore records with missing values:  This is usually done when the class label is missing.  This method is not effective, unless the record contains several attributes with missing values.
  • 18. Dealing with missing values 2. Fill in the missing value manually: In general, this approach is time-consuming and may be not feeble given a large data set with many missing values. 3. Fill in the missing value manually: Replace all missing values by same constant such as “unknown”. Although this method is simple but it is not recommended because results with “unknown values are not “interesting”.
  • 19. Dealing with missing values 4. Use the attribute mean to fill missing values: For example in attribute income if the mean income is 28000, use this value to replace the missing values. 5. Use the attribute mean for all samples belonging to the same class For example, if classifying customers according to credit risk, replace the missing value with the mean income value for customers in the same credit risk category as that of the given record.
  • 20. Dealing with missing values 6. Use advanced method such as K-nearest neighbors formalism or decision tree to predict the missing value using other values.
  • 21. Dealing with missing values k nearest neighbors Approach Compute the k nearest neighbors and assign a value from them.
  • 22. Dealing with missing values k nearest neighbors Approach  For nominal values, use the most common value among all neighbors.  For numerical values use the average value.  Indeed, we need to define a proximity measure between instances, such as euclidian distance.
  • 24. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 2
  • 25. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 27.  Noise is a random error in measured variable.  Noisy data is meaningless data.  Any data that has been received, stored or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy. Noisy Data
  • 28.  Source of Noisy data: 1. Data entry problem. 2. Faulty data collection instruments. 3. Data transmission. Noisy Data
  • 29.  Binning method  Clustering  Combined computer and human inspections  Regression How to handle noisy data ?
  • 30. How to handle noisy data ?  Binning method: 1. Sort data 2. Partition into equal-frequency groups. 3. One can smooth by group means, smooth by group median, smooth by group boundaries, etc.
  • 31. How to handle noisy data ? Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equal-frequency) groups: -G1: 4, 8, 9, 15 -G2: 21, 21, 24, 25 -G3: 26, 28, 29, 34 Smoothing by bin means: -G1: 9, 9, 9, 9 -G2: 23, 23, 23, 23 -G3: 29, 29, 29, 29 Smoothing by bin boundaries: -G1: 4, 4, 4, 15 -G2: 21, 21, 25, 25 -G3: 26, 26, 26, 34
  • 32. How to handle noisy data ? Clustering: Outliers may be detected by clustering, where similar values are organized into groups, values that fall outside the set of clusters may be considered outliers.
  • 33. How to handle noisy data ?  Combined computer and human inspections: Outliers may be identified by detect suspicious values and check by human.
  • 34. How to handle noisy data ?  Regression: Data can be smoothed by fitting the data to a function.
  • 35. Inconsistent Data  Data which is inconsistent with our models, should be dealt with.  Common sense can also be used to detect such kind of inconsistency: The same name occurring differently in an application. Different names can appear to be the same (Dennis Vs Denis) Inappropriate values (Males being pregnant, or having an negative age) Was rating “1,2,3”, now rating “A, B, C” Difference between duplicate records
  • 36. Inconsistent Data  We want to transform all dates to the same format internally  Some systems accept dates in many formats  e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc  dates are transformed internally to a standard value  Frequently, just the year (YYYY) is sufficient  For more details, we may need the month, the day, the hour, etc  Representing date as YYYYMM or YYYYMMDD can be OK.
  • 37. Data Integration Goal identification & Data Understanding Data Cleaning Data Integration Data Transformation Data Reduction
  • 38. Data Integration  Combines data from multiple sources into a coherent store.  Increasingly data a mining projects require data from more than one data source.  Such as multiple databases, data warehouse, flat files and historical data.
  • 39. Data Integration  Data is stored in many systems across enterprise and outside the enterprise The source of data fall into two categories:  Internal sources that are generated through enterprise activities such as databases, historical data, Web sites and warehouses.  External sources such as credit bureaus, phone companies and demographical information.
  • 40. Data Integration  Data Warehouse: is a structure that links information from two or more databases.  Data warehouse brings data from different data sources into a central repository.  It performs some data integration, clean-up, and summarization, and distribute the information data marts.
  • 43. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 3
  • 44. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 47.  Definition 1: Transform the data into a form appropriate for given data mining method.  Definition 2: Data transformation is the process of converting data or information from one format to another, usually from the format of a source system into the required format of a new destination system. Data Transformation
  • 48.  Methods include:  Smoothing  Aggregation  Generalization  Normalization (min-max) Data Transformation
  • 49. Methods of Data Transformation  Normalization: Where the attributes are scaled so as to fall within a small specified ranges such as -1.0 to 1.0.
  • 50. How to handle noisy data ?
  • 52. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 4
  • 53. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 54. Introduction Goal identification and Data Understanding Data Cleaning Data Integration Data TransformationData Reduction
  • 57.  Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set.  Data reduction: Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results. Data Reduction
  • 58.  The choice of data representation, and selection, reduction or transformation of features is probably the most important issue that determines the quality of a data-mining solution. Data Reduction
  • 59.  The three basic operations in a data-reduction process are:  Delete a column (feature selection).  Delete a row (sampling).  Reduce the number of values in a column (Discretization). Data Reduction
  • 60. Feature Selection  We want to choose features (attributes) that are relevant to our data-mining application in order to achieve maximum performance with the minimum measurement and processing effort.
  • 61. Feature Selection 1. Redundant features  Duplicate much or all of the information contained in one or more other attributes  E.g., purchase price of a product and the amount of sales tax paid.
  • 62. Feature Selection 2. Irrelevant features  Contain no information that is useful for the data mining task at hand. E.g., students' ID is often irrelevant to the task of predicting students' GPA.
  • 63. Feature Selection 3. Selecting Most Relevant Fields  If there are too many fields, select a subset that is most relevant. Can select top N fields using some computations. What is good N?  Rule of thumb -- keep top 50 fields
  • 64. Feature Selection  Two types of feature selection  Unsupervised: Reduce fields without knowing class label. Supervised: Select fields with respect to class label.
  • 65. Sampling  Sampling: Obtaining a small sample s to represent the whole data set N. Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data.
  • 66. Sampling  Key principle: Choose a representative subset of the data.  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods, e.g., stratified sampling.
  • 67. Sampling 8000 points 2000 Points 500 Points Sample Size
  • 68. Types of Sampling  Sampling without replacement:  Once an object is selected, it is removed from the population.  Sampling with replacement  A selected object is not removed from the population.  Stratified sampling:  Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data)
  • 69. Types of Sampling(Sampling without replacement) Raw Data
  • 70. Types of Sampling(Sampling with replacement) Raw Data
  • 71. Types of Sampling Raw Data Cluster/Stratified Sample
  • 73. Discretization  Discretization is very useful for generating a summary of data, also called “binning”.  It does not use the class information.  Suppose we have the following set of values for the attribute - AGE : 0, 4, 12, 16, 16, 18, 24, 26, 28. Two possible ways in which Binning can be applied are: Equi-width binning or Equi-frequency binning .