SlideShare a Scribd company logo
1 of 14
Data
Preprocessing
HELLO!
I am Iffat Firozy
I am here because I love to teach.
2
1.
Introduction
“Data preprocessing is a data
mining technique which is used to
transform the raw data in a useful
and efficient format.”
4
WHY???
5
We preprocess the data for
better -
6
MAJOR TASK IN DATA
PREPROCESSING
7
8
DATA PREPROCESSING
DATA CLEANING
MISSING VALUES
NOISY DATA
DATA
INTEGRATION
DETECTION AND
RESOLUTION
REDUNDANT
ATTRIBUTES
SCEMA
INTREGATION
DATA
TRANSFORMATION
NORMALIZATION
ATTRIBUTE
SELECTION
DISCRITIZATION
DATA
REDUCTION
NUMEROCITY
REDUCTION
DATA
COMPRESSION
DIMONTIONALITY
REDUCTION
Data Cleaning
Missing values
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
❑ Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.
❑ Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
9
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
Continue…
Noisy Data
Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following
ways :
 Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
❑ Regression:
Here data can be made smooth by fitting it to a regression function. The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).
❑ Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
10
Data Integration
Data analysis may require a
combination of data from multiple
sources into a coherent data store.
Schema integration: CID = C_number =
Cust-id = cust#
Data value conflicts (different
representations or scales, etc.)
 Synchronization (especially important in
Web usage mining)
 Redundant attributes (redundant if it can
be derived from other attributes)
11
DATA TRANSFORMATION
12
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.
DATA REDUCTION
13
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).
14
THANKS!
Any questions?
You can find me at:
❑ ifirozy@gmail.com

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ksamyMCA
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Advance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In DatabaseAdvance Database Management Systems -Object Oriented Principles In Database
Advance Database Management Systems -Object Oriented Principles In Database
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 

Similar to Data Preprocessing || Data Mining

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Data pre processing
Data pre processingData pre processing
Data pre processing
pommurajopt
 

Similar to Data Preprocessing || Data Mining (20)

Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
1234
12341234
1234
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Data processing
Data processingData processing
Data processing
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Preprocess
PreprocessPreprocess
Preprocess
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 

More from Iffat Firozy

More from Iffat Firozy (9)

Association Rule Mining || Data Mining
Association Rule Mining || Data MiningAssociation Rule Mining || Data Mining
Association Rule Mining || Data Mining
 
K-means Clustering || Data Mining
K-means Clustering || Data MiningK-means Clustering || Data Mining
K-means Clustering || Data Mining
 
Decision Tree || Data Mining ..
Decision Tree || Data Mining ..Decision Tree || Data Mining ..
Decision Tree || Data Mining ..
 
Data mining || Decision tree..
Data mining || Decision tree..Data mining || Decision tree..
Data mining || Decision tree..
 
Data Mining || Decision Tree
Data Mining || Decision TreeData Mining || Decision Tree
Data Mining || Decision Tree
 
Hidden Markov Model
Hidden Markov ModelHidden Markov Model
Hidden Markov Model
 
Internet of things (Iot)
Internet of things (Iot)Internet of things (Iot)
Internet of things (Iot)
 
Hospital Introducer & Direction Giving Robot.
Hospital Introducer & Direction Giving Robot.Hospital Introducer & Direction Giving Robot.
Hospital Introducer & Direction Giving Robot.
 
How to calculate SGPA & CGPA
How to calculate SGPA & CGPAHow to calculate SGPA & CGPA
How to calculate SGPA & CGPA
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 

Data Preprocessing || Data Mining

  • 2. HELLO! I am Iffat Firozy I am here because I love to teach. 2
  • 4. “Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.” 4
  • 6. We preprocess the data for better - 6
  • 7. MAJOR TASK IN DATA PREPROCESSING 7
  • 8. 8 DATA PREPROCESSING DATA CLEANING MISSING VALUES NOISY DATA DATA INTEGRATION DETECTION AND RESOLUTION REDUNDANT ATTRIBUTES SCEMA INTREGATION DATA TRANSFORMATION NORMALIZATION ATTRIBUTE SELECTION DISCRITIZATION DATA REDUCTION NUMEROCITY REDUCTION DATA COMPRESSION DIMONTIONALITY REDUCTION
  • 9. Data Cleaning Missing values This situation arises when some data is missing in the data. It can be handled in various ways. Some of them are: ❑ Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple. ❑ Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value. 9 The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.
  • 10. Continue… Noisy Data Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways :  Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task. ❑ Regression: Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). ❑ Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters. 10
  • 11. Data Integration Data analysis may require a combination of data from multiple sources into a coherent data store. Schema integration: CID = C_number = Cust-id = cust# Data value conflicts (different representations or scales, etc.)  Synchronization (especially important in Web usage mining)  Redundant attributes (redundant if it can be derived from other attributes) 11
  • 12. DATA TRANSFORMATION 12 This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways: Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0) Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help the mining process. Discretization: This is done to replace the raw values of numeric attribute by interval levels or conceptual levels. Concept Hierarchy Generation: Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”.
  • 13. DATA REDUCTION 13 Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs. The various steps to data reduction are: Numerosity Reduction: This enable to store the model of data instead of whole data, for example: Regression Models. Dimensionality Reduction: This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).
  • 14. 14 THANKS! Any questions? You can find me at: ❑ ifirozy@gmail.com