SlideShare uma empresa Scribd logo
1 de 15
Data Preprocessing
By
S.Dinesh Babu
II MCA
Definition
 Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
 Data in the real world is dirty
Measures for data quality: A multidimensional
view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not,
dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are
correct?
◦ Interpretability: how easily the data can be
understood?
Major Tasks in Data
Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data
Discretization
Data Cleaning: Incomplete
 Data is not always available
 Ex:Age:” ”;
 Missing data may be due to
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the
time of entry
Noisy Data
 Unstructured Data.
 Increases the amount of storage space .
Causes:
Hardware Failure
Programming Errors
Data Cleaning as a Process
 Missing values, noise, and inconsistencies contribute to
inaccurate data.
 The first step in data cleaning as a process is
discrepancy detection.
 Discrepancies can be caused by several factors.
 Poorly designed data entry forms
 human error in data entry
The data should also be examined regarding:
o Unique rule:
Each attribute value must be different from all other attribute
value.
o Consecutive rule
No missing values between lowest and highest values of the
attribute.
o Null rule
Specifies the use of blanks, question marks, special
characters.
Data Integration
 The merging of data from multiple data stores.
 It can help reduce, avoid redundancies and
inconsistencies.
 It improve the accuracy and speed of the subsequent
data mining process.
Data Reduction
 To obtain a reduced representation of the data set that is
much smaller in volume.
Strategies for data reduction include the following:
 Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
 Attribute subset selection, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
 Dimensionality reduction, where encoding mechanisms are
used to reduce the data set size.
 Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations such as
 Parametric models
 Nonparametric methods such as clustering, sampling,
and the use of histograms.
Data Transformation
 In data transformation, the data are transformed or
consolidated into forms appropriate for mining.
Data transformation can involve the following:
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range
 min-max normalization
Data Discretization
 Discretization: Divide the range of a continuous attribute
into intervals
◦ Interval labels can then be used to replace actual data
values
◦ Reduce data size by Discretization
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an
attribute
◦ Prepare for further analysis, e.g., classification
 Three types of attributes
◦ Nominal—values from an unordered set, e.g., color,
profession
◦ Ordinal—values from an ordered set, e.g., military or
academic rank
◦ Numeric—real numbers, e.g., integer or real numbers
Thank You

Mais conteúdo relacionado

Mais procurados

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingAmuthamca
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Data pre processing
Data pre processingData pre processing
Data pre processingjunnubabu
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unitbhagathk
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data MiningSamad Baseer Khan
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 

Mais procurados (18)

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 

Destaque

Models of Escalation and De-escalation in Cyber Conflict
Models of Escalation and De-escalation in Cyber ConflictModels of Escalation and De-escalation in Cyber Conflict
Models of Escalation and De-escalation in Cyber ConflictZsolt Nemeth
 
Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..killerkarthic
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingvenkadesh236
 
Spss basic1
Spss basic1Spss basic1
Spss basic1UPM
 
Introduction to spss – part 1
Introduction to spss – part 1Introduction to spss – part 1
Introduction to spss – part 1Dr. Vignes Gopal
 
Descriptives & Graphing
Descriptives & GraphingDescriptives & Graphing
Descriptives & GraphingJames Neill
 
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
BID CE workshop 1   session 08 - Biodiversity Data CleaningBID CE workshop 1   session 08 - Biodiversity Data Cleaning
BID CE workshop 1 session 08 - Biodiversity Data CleaningAlberto González-Talaván
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineBertram Ludäscher
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingSlideshare
 
Correlation in simple terms
Correlation in simple termsCorrelation in simple terms
Correlation in simple termsstats2analytics
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam
 
Pa 298 measures of correlation
Pa 298 measures of correlationPa 298 measures of correlation
Pa 298 measures of correlationMaria Theresa
 
Correlation in physical science
Correlation in physical science Correlation in physical science
Correlation in physical science teenathankachen1993
 
Correlation VS Causation
Correlation VS CausationCorrelation VS Causation
Correlation VS CausationColleen Carmean
 
Costaatt spss presentation
Costaatt spss presentationCostaatt spss presentation
Costaatt spss presentationkesterdavid
 

Destaque (20)

Cr day 07
Cr day 07Cr day 07
Cr day 07
 
Models of Escalation and De-escalation in Cyber Conflict
Models of Escalation and De-escalation in Cyber ConflictModels of Escalation and De-escalation in Cyber Conflict
Models of Escalation and De-escalation in Cyber Conflict
 
Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Spss basic1
Spss basic1Spss basic1
Spss basic1
 
Preprocess
PreprocessPreprocess
Preprocess
 
Introduction to spss – part 1
Introduction to spss – part 1Introduction to spss – part 1
Introduction to spss – part 1
 
Descriptive statistics ii
Descriptive statistics iiDescriptive statistics ii
Descriptive statistics ii
 
Descriptives & Graphing
Descriptives & GraphingDescriptives & Graphing
Descriptives & Graphing
 
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
BID CE workshop 1   session 08 - Biodiversity Data CleaningBID CE workshop 1   session 08 - Biodiversity Data Cleaning
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Correlation in simple terms
Correlation in simple termsCorrelation in simple terms
Correlation in simple terms
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
Pa 298 measures of correlation
Pa 298 measures of correlationPa 298 measures of correlation
Pa 298 measures of correlation
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation in physical science
Correlation in physical science Correlation in physical science
Correlation in physical science
 
Correlation VS Causation
Correlation VS CausationCorrelation VS Causation
Correlation VS Causation
 
Costaatt spss presentation
Costaatt spss presentationCostaatt spss presentation
Costaatt spss presentation
 
Basic One-Way ANOVA
Basic One-Way ANOVABasic One-Way ANOVA
Basic One-Way ANOVA
 

Semelhante a Data preprocessing

Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology VaishaghMp
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processingDatamining Tools
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
DATA CLEANING.pdf
DATA CLEANING.pdfDATA CLEANING.pdf
DATA CLEANING.pdfRumanaAykiz
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptRevathy V R
 

Semelhante a Data preprocessing (20)

Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
Unit2
Unit2Unit2
Unit2
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Pre processing
Pre processingPre processing
Pre processing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
1234
12341234
1234
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
DATA CLEANING.pdf
DATA CLEANING.pdfDATA CLEANING.pdf
DATA CLEANING.pdf
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 

Último

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 

Último (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 

Data preprocessing

  • 2. Definition  Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.  Data in the real world is dirty
  • 3. Measures for data quality: A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?
  • 4. Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization
  • 5. Data Cleaning: Incomplete  Data is not always available  Ex:Age:” ”;  Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry
  • 6. Noisy Data  Unstructured Data.  Increases the amount of storage space . Causes: Hardware Failure Programming Errors
  • 7. Data Cleaning as a Process  Missing values, noise, and inconsistencies contribute to inaccurate data.  The first step in data cleaning as a process is discrepancy detection.  Discrepancies can be caused by several factors.  Poorly designed data entry forms  human error in data entry
  • 8. The data should also be examined regarding: o Unique rule: Each attribute value must be different from all other attribute value. o Consecutive rule No missing values between lowest and highest values of the attribute. o Null rule Specifies the use of blanks, question marks, special characters.
  • 9. Data Integration  The merging of data from multiple data stores.  It can help reduce, avoid redundancies and inconsistencies.  It improve the accuracy and speed of the subsequent data mining process.
  • 10. Data Reduction  To obtain a reduced representation of the data set that is much smaller in volume. Strategies for data reduction include the following:  Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.  Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  • 11.  Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.  Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as  Parametric models  Nonparametric methods such as clustering, sampling, and the use of histograms.
  • 12. Data Transformation  In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization
  • 13. Data Discretization  Discretization: Divide the range of a continuous attribute into intervals ◦ Interval labels can then be used to replace actual data values ◦ Reduce data size by Discretization ◦ Split (top-down) vs. merge (bottom-up) ◦ Discretization can be performed recursively on an attribute ◦ Prepare for further analysis, e.g., classification
  • 14.  Three types of attributes ◦ Nominal—values from an unordered set, e.g., color, profession ◦ Ordinal—values from an ordered set, e.g., military or academic rank ◦ Numeric—real numbers, e.g., integer or real numbers