SlideShare uma empresa Scribd logo
1 de 20
By
C.Kayathri
Student at ANJAC
Sivakasi

1
Data Preprocessing


Definition



Why preprocess the data?



Data cleaning



Data integration and transformation



Data reduction



Summary
2
Definition


Data preprocessing is a data
mining technique that involves
transforming raw data into an
understandable format.

3
Why Data Preprocessing?


Data in the real world is dirty
incomplete: lacking attribute values, lacking

certain attributes of interest, or containing only
aggregate
data
○ e.g., occupation=“ ”

noisy: containing errors or outliers
○ e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or

names
○ e.g., Age=“42” Birthday=“03/07/1997”
○ e.g., Was rating “1,2,3”, now rating “A, B, C”
○ e.g., discrepancy between duplicate records
4
Why Is Data Preprocessing Important?


No quality data, no quality mining results!
Quality decisions must be based on quality data
○ e.g., duplicate or missing data may cause incorrect

or even misleading statistics.
Data warehouse needs consistent integration of

quality data


Data extraction, cleaning, and transformation
comprises the majority of the work of building a data
warehouse

5
Major Tasks in Data Preprocessing





Data cleaning
Data integration
Data transformation
Data reduction

6
Forms of Data Preprocessing

7
Data Cleaning


Importance
“Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey



Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

8
Missing Data


Data is not always available
 E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data


Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of

entry
 not register history or changes of the data


Missing data may need to be inferred.
9
Noisy Data





Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
10
How to Handle Noisy Data?






Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
11
Simple Discretization Methods: Binning


Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,

the width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate

presentation
 Skewed data is not handled well


Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately

same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
12
Binning Methods for Data
Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34


13
Regression
y
Y1

y=x+1

Y1’

x

X1

14
Cluster Analysis

15
Data Integration







Data integration:
Combines data from multiple sources into a coherent
store
Schema integration: e.g., A.cust-id ≡ B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different
scales, e.g., metric vs. British units

16
Data Transformation


Smoothing: remove noise from data



Aggregation: summarization, data cube construction



Generalization: concept hierarchy climbing



Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling



Attribute/feature construction
New attributes constructed from the given ones
17
Data Reduction






Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
 Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
 Data cube aggregation:
 Dimensionality reduction — e.g., remove unimportant
attributes
 Data Compression
 Numerosity reduction — e.g., fit data into models
 Discretization and concept hierarchy generation
18
Summary


Data preparation or preprocessing is a big issue for both
data warehousing and data mining



Descriptive data summarization is need for quality data
preprocessing



Data preparation includes
Data cleaning and data integration
Data reduction and feature selection



A lot a methods have been developed but data
preprocessing still an active area of research

19
an
Th

ou
y
k

20

Mais conteúdo relacionado

Mais procurados

03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
purnimatm
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Tony Nguyen
 

Mais procurados (20)

Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 

Destaque

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Slideshare
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
suganmca14
 
Data pre processing
Data pre processingData pre processing
Data pre processing
pommurajopt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ksamyMCA
 

Destaque (16)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Adaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataAdaptive pre-processing for streaming data
Adaptive pre-processing for streaming data
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project Report
 
Summer School on Big Data Information Visulisation
Summer School on Big Data Information Visulisation Summer School on Big Data Information Visulisation
Summer School on Big Data Information Visulisation
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data discretization
Data discretizationData discretization
Data discretization
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Normalization
NormalizationNormalization
Normalization
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Data Processing
Data ProcessingData Processing
Data Processing
 

Semelhante a Data preprocessing

Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
Shree Hari
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
saranya12345
 
Data preperation
Data preperationData preperation
Data preperation
Fraboni Ec
 
Data preparation
Data preparationData preparation
Data preparation
James Wong
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 

Semelhante a Data preprocessing (20)

Data processing
Data processingData processing
Data processing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Data Mining
Data MiningData Mining
Data Mining
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Cs501 data preprocessingdw
Cs501 data preprocessingdwCs501 data preprocessingdw
Cs501 data preprocessingdw
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
Preprocess
PreprocessPreprocess
Preprocess
 

Último

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Último (20)

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 

Data preprocessing

  • 2. Data Preprocessing  Definition  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Summary 2
  • 3. Definition  Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. 3
  • 4. Why Data Preprocessing?  Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ○ e.g., occupation=“ ” noisy: containing errors or outliers ○ e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names ○ e.g., Age=“42” Birthday=“03/07/1997” ○ e.g., Was rating “1,2,3”, now rating “A, B, C” ○ e.g., discrepancy between duplicate records 4
  • 5. Why Is Data Preprocessing Important?  No quality data, no quality mining results! Quality decisions must be based on quality data ○ e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 5
  • 6. Major Tasks in Data Preprocessing     Data cleaning Data integration Data transformation Data reduction 6
  • 7. Forms of Data Preprocessing 7
  • 8. Data Cleaning  Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data cleaning is the number one problem in data warehousing”—DCI survey  Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration 8
  • 9. Missing Data  Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred. 9
  • 10. Noisy Data    Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data 10
  • 11. How to Handle Noisy Data?     Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) 11
  • 12. Simple Discretization Methods: Binning  Equal-width (distance) partitioning  Divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.  The most straightforward, but outliers may dominate presentation  Skewed data is not handled well  Equal-depth (frequency) partitioning  Divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky 12
  • 13. Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34  13
  • 16. Data Integration     Data integration: Combines data from multiple sources into a coherent store Schema integration: e.g., A.cust-id ≡ B.cust-# Integrate metadata from different sources Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs. British units 16
  • 17. Data Transformation  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling  Attribute/feature construction New attributes constructed from the given ones 17
  • 18. Data Reduction    Why data reduction?  A database/data warehouse may store terabytes of data  Complex data analysis/mining may take a very long time to run on the complete data set Data reduction  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies  Data cube aggregation:  Dimensionality reduction — e.g., remove unimportant attributes  Data Compression  Numerosity reduction — e.g., fit data into models  Discretization and concept hierarchy generation 18
  • 19. Summary  Data preparation or preprocessing is a big issue for both data warehousing and data mining  Descriptive data summarization is need for quality data preprocessing  Data preparation includes Data cleaning and data integration Data reduction and feature selection  A lot a methods have been developed but data preprocessing still an active area of research 19