SlideShare uma empresa Scribd logo
1 de 13
WHY DATA PREPROCESSING?
Data in the real world is dirty

 incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“”

 noisy: containing errors or outliers
 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
MAJOR TASKS IN DATA PREPROCESSING

Data cleaning

 Fill in missing values, smooth noisy data, identify or remove
outliers and noisy data, and resolve inconsistencies

Data integration

 Integration of multiple databases, or files

Data transformation

 Normalization and aggregation

Data reduction

 Obtains reduced representation in volume but produces the same
or similar analytical results

Data discretization (for numerical data)
DATA CLEANING
Importance

 “Data cleaning is the number one problem in data
warehousing”

Data cleaning tasks – this routine attempts to
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
MISSING DATA
Data is not always available
 E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data

Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
DATA INTEGRATION
Data integration:
 combines data from multiple sources(data cubes, multiple db or
flat files)
Issues during data integration
 Schema integration
 integrate metadata (about the data) from different sources
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id B.cust-#(same entity?)
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units
 Removing duplicates and redundant data
 An attribute can be derived from another table (annual revenue)
 Inconsistencies in attribute naming
DATA TRANSFORMATION
Smoothing: remove noise from data
(binning, clustering, regression)
Normalization: scaled to fall within a small, specified range
such as –1.0 to 1.0 or 0.0 to 1.0
Attribute/feature construction
 New attributes constructed / added from the given ones
Aggregation: summarization or aggregation operations
apply to data
Generalization: concept hierarchy climbing
 Low level/ primitive/raw data are replace by higher level
concepts
DATA REDUCTION STRATEGIES
Data is too big to work with – may takes
time, impractical or infeasible analysis
Data reduction techniques
 Obtain a reduced representation of the data set that
is much smaller in volume but yet produce the same
(or almost the same) analytical results

Data reduction strategies
 Data cube aggregation – apply aggregation
operations (data cube)
CLUSTERING
Partition data set into clusters, and one can store cluster representation
only
Can be very effective if data is clustered but not if data is “smeared”/

spread
There are many choices of clustering definitions and clustering algorithms.
We will discuss them later.
SAMPLING
Data reduction technique
 A large data set to be represented by much smaller
random sample or subset.

4 types
 Simple random sampling without replacement
(SRSWOR).
 Simple random sampling with replacement
(SRSWR).
 Develop adaptive sampling methods such as cluster
sample and stratified sample
DISCRETIZATION AND CONCEPT HIERARCHY
Discretization
 reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values

Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age)
by higher level concepts (such as young, middle-aged, or
senior)
SOME TECHNIQUES

-Binning methods – equal-width, equal-frequency
-Histogram
- Entropy-based methods
SUMMARY
Data preparation is a big issue for data
mining

Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization

Many methods have been proposed but
still an active area of research

Mais conteúdo relacionado

Mais procurados

03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
purnimatm
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Mais procurados (14)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Destaque (16)

Photos for Prohibition
Photos for ProhibitionPhotos for Prohibition
Photos for Prohibition
 
Herramientas 2
Herramientas 2Herramientas 2
Herramientas 2
 
Patrick2
Patrick2Patrick2
Patrick2
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Music video analysis
Music video analysisMusic video analysis
Music video analysis
 
Chapter 6(1)
Chapter 6(1)Chapter 6(1)
Chapter 6(1)
 
Chapter 4 (1)
Chapter 4 (1)Chapter 4 (1)
Chapter 4 (1)
 
Gun control
Gun controlGun control
Gun control
 
Chapter 1(1)
Chapter 1(1)Chapter 1(1)
Chapter 1(1)
 
Morae - Türkçe
Morae - TürkçeMorae - Türkçe
Morae - Türkçe
 
Fys video analysis (5)
Fys video analysis (5)Fys video analysis (5)
Fys video analysis (5)
 
El comercio exterior_en_mexico
El comercio exterior_en_mexicoEl comercio exterior_en_mexico
El comercio exterior_en_mexico
 
Amin aminoaxit
Amin aminoaxitAmin aminoaxit
Amin aminoaxit
 
Chapter 2(1)
Chapter 2(1)Chapter 2(1)
Chapter 2(1)
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Chuyên đề.este
Chuyên đề.esteChuyên đề.este
Chuyên đề.este
 

Semelhante a Data preprocessing (20)

Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Preprocess
PreprocessPreprocess
Preprocess
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 

Último

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Data preprocessing

  • 1.
  • 2. WHY DATA PREPROCESSING? Data in the real world is dirty  incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data  e.g., occupation=“”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records
  • 3. MAJOR TASKS IN DATA PREPROCESSING Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies Data integration  Integration of multiple databases, or files Data transformation  Normalization and aggregation Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results Data discretization (for numerical data)
  • 4. DATA CLEANING Importance  “Data cleaning is the number one problem in data warehousing” Data cleaning tasks – this routine attempts to  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration
  • 5. MISSING DATA Data is not always available  E.g., many tuples have no recorded values for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data
  • 6. DATA INTEGRATION Data integration:  combines data from multiple sources(data cubes, multiple db or flat files) Issues during data integration  Schema integration  integrate metadata (about the data) from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#(same entity?)  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units  Removing duplicates and redundant data  An attribute can be derived from another table (annual revenue)  Inconsistencies in attribute naming
  • 7. DATA TRANSFORMATION Smoothing: remove noise from data (binning, clustering, regression) Normalization: scaled to fall within a small, specified range such as –1.0 to 1.0 or 0.0 to 1.0 Attribute/feature construction  New attributes constructed / added from the given ones Aggregation: summarization or aggregation operations apply to data Generalization: concept hierarchy climbing  Low level/ primitive/raw data are replace by higher level concepts
  • 8. DATA REDUCTION STRATEGIES Data is too big to work with – may takes time, impractical or infeasible analysis Data reduction techniques  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies  Data cube aggregation – apply aggregation operations (data cube)
  • 9. CLUSTERING Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared”/ spread There are many choices of clustering definitions and clustering algorithms. We will discuss them later.
  • 10. SAMPLING Data reduction technique  A large data set to be represented by much smaller random sample or subset. 4 types  Simple random sampling without replacement (SRSWOR).  Simple random sampling with replacement (SRSWR).  Develop adaptive sampling methods such as cluster sample and stratified sample
  • 11. DISCRETIZATION AND CONCEPT HIERARCHY Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
  • 12. SOME TECHNIQUES -Binning methods – equal-width, equal-frequency -Histogram - Entropy-based methods
  • 13. SUMMARY Data preparation is a big issue for data mining Data preparation includes  Data cleaning and data integration  Data reduction and feature selection  Discretization Many methods have been proposed but still an active area of research