SlideShare uma empresa Scribd logo
1 de 13
PROGRAMMING CRASH COURSE
KEDS BIODESIGNS
WEEK : II
CLASS : 5
SESSION : MACHINE LEARNING DATA
MACHINE LEARNING - PREPARING DATA
Introduction
 Machine Learning algorithms are completely dependent on data because it is the most crucial aspect
that makes model training possible. On the other hand, if we won’t be able to make sense out of that
data, before feeding it to ML algorithms, a machine will be useless. In simple words, we always need to
feed right data i.e. the data in correct scale, format and containing meaningful features, for the problem
we want machine to solve.
 This makes data preparation the most important step in ML process. Data preparation may be defined as
the procedure that makes our dataset more appropriate for ML process.
PRE-PROCESSING
Why Data Pre-processing?
After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data
preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We
always need to preprocess our data so that it can be as per the expectation of machine learning algorithm.
Data Pre-processing Techniques
 We have the following data preprocessing techniques that can be applied on data set to produce data for ML
algorithms
Scaling
 Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to
ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally,
attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors
requires scaled data. We can rescale the data with the help of MinMaxScaler class of scikit-learn Python library.
EXAMPLE
 In this example we will rescale the
data of Pima Indians Diabetes
dataset which we used earlier.
First, the CSV data will be loaded
(as done in the previous
chapters) and then with the help
of MinMaxScaler class, it will be
rescaled in the range of 0 and
1.The first few lines of the
following script are same as we
have written in previous chapters
while loading CSV data.
NORMALIZATION
 Another useful data preprocessing technique is Normalization. This is used to rescale each row of data
to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale
the data with the help of Normalizer class of scikit-learn Python library.
 Types of Normalization
 In machine learning, there are two types of normalization preprocessing techniques as follows −
 L1 Normalization
 L2 Normalization
BINARIZATION
 As the name suggests, this is the technique with the help of which we can make our data binary. We can
use a binary threshold for making our data binary. The values above that threshold value will be
converted to 1 and below that threshold will be converted to 0.
 For example, if we choose threshold value = 0.5, then the dataset value above it will become 1 and
below this will become 0. That is why we can call it binarizing the data or thresholding the data. This
technique is useful when we have probabilities in our dataset and want to convert them into crisp values.
 We can binarize the data with the help of Binarizer class of scikit-learn Python library
BINARIZATION - EXAMPLE
 In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the
CSV data will be loaded and then with the help of Binarizer class it will be converted into binary values
i.e. 0 and 1 depending upon the threshold value. We are taking 0.5 as threshold value.
 The first few lines of following script are same as we have written in previous chapters while loading CSV
data.
RESULT
STANDARDIZATION
 Another useful data preprocessing technique which is basically used to transform the data attributes
with a Gaussian distribution. It differs the mean and SD (Standard Deviation) to a standard Gaussian
distribution with a mean of 0 and a SD of 1. This technique is useful in ML algorithms like linear
regression, logistic regression that assumes a Gaussian distribution in input dataset and produce better
results with rescaled data. We can standardize the data (mean = 0 and SD =1) with the help
of StandardScaler class of scikit-learn Python library.
EXAMPLE
 In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First,
the CSV data will be loaded and then with the help of StandardScaler class it will be converted into
Gaussian Distribution with mean = 0 and SD = 1.
 The first few lines of following script are same as we have written in previous chapters while loading
CSV data.
DATA LABELING
 We discussed the importance of good fata for ML algorithms as well as some techniques to pre-process
the data before sending it to ML algorithms. One more aspect in this regard is data labeling. It is also
very important to send the data to ML algorithms having proper labeling. For example, in case of
classification problems, lot of labels in the form of words, numbers etc. are there on the data.
What is Label Encoding?
Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we
need to convert such labels into number labels. This process is called label encoding. We can perform label
encoding of data with the help of LabelEncoder() function of scikit-learn Python library.
EXAMPLE
RESULT

Mais conteúdo relacionado

Semelhante a Machine learning session 5

machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptxjasontseng19
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptxLuminous8
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016 Mahesh Dananjaya
 
Machine learning session 3
Machine learning session 3Machine learning session 3
Machine learning session 3NirsandhG
 
Machine learning session 4
Machine learning session 4Machine learning session 4
Machine learning session 4NirsandhG
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By SparkKnoldus Inc.
 
1414Database DesignDatabase design is the process o.docx
 1414Database DesignDatabase design is the process o.docx 1414Database DesignDatabase design is the process o.docx
1414Database DesignDatabase design is the process o.docxjoyjonna282
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaData Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaSubhasish Guha
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
python's data classes.pdf
python's data classes.pdfpython's data classes.pdf
python's data classes.pdfAkash NR
 
Data Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics MethodologyData Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics MethodologyRupak Roy
 
housing price prediction ppt in artificial
housing price prediction ppt in artificialhousing price prediction ppt in artificial
housing price prediction ppt in artificialKrishPatel802536
 

Semelhante a Machine learning session 5 (20)

Data processing
Data processingData processing
Data processing
 
machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptx
 
Predictive modeling
Predictive modelingPredictive modeling
Predictive modeling
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
PythonML.pptx
PythonML.pptxPythonML.pptx
PythonML.pptx
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016
 
Machine learning session 3
Machine learning session 3Machine learning session 3
Machine learning session 3
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Machine learning session 4
Machine learning session 4Machine learning session 4
Machine learning session 4
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
1414Database DesignDatabase design is the process o.docx
 1414Database DesignDatabase design is the process o.docx 1414Database DesignDatabase design is the process o.docx
1414Database DesignDatabase design is the process o.docx
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaData Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
python's data classes.pdf
python's data classes.pdfpython's data classes.pdf
python's data classes.pdf
 
Data Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics MethodologyData Preparation with the help of Analytics Methodology
Data Preparation with the help of Analytics Methodology
 
housing price prediction ppt in artificial
housing price prediction ppt in artificialhousing price prediction ppt in artificial
housing price prediction ppt in artificial
 

Mais de NirsandhG

Machine learning session 10
Machine learning session 10Machine learning session 10
Machine learning session 10NirsandhG
 
Machine learning session 9
Machine learning session 9Machine learning session 9
Machine learning session 9NirsandhG
 
Machine learning session 8
Machine learning session 8Machine learning session 8
Machine learning session 8NirsandhG
 
Machine learning session 7
Machine learning session 7Machine learning session 7
Machine learning session 7NirsandhG
 
Machine learning session 1
Machine learning session 1Machine learning session 1
Machine learning session 1NirsandhG
 
Augmented reality session 5
Augmented reality session 5Augmented reality session 5
Augmented reality session 5NirsandhG
 
Augmented reality session 4
Augmented reality session 4Augmented reality session 4
Augmented reality session 4NirsandhG
 
Augmented reality session 3
Augmented reality session 3Augmented reality session 3
Augmented reality session 3NirsandhG
 
Augmented reality session 2
Augmented reality session 2Augmented reality session 2
Augmented reality session 2NirsandhG
 

Mais de NirsandhG (9)

Machine learning session 10
Machine learning session 10Machine learning session 10
Machine learning session 10
 
Machine learning session 9
Machine learning session 9Machine learning session 9
Machine learning session 9
 
Machine learning session 8
Machine learning session 8Machine learning session 8
Machine learning session 8
 
Machine learning session 7
Machine learning session 7Machine learning session 7
Machine learning session 7
 
Machine learning session 1
Machine learning session 1Machine learning session 1
Machine learning session 1
 
Augmented reality session 5
Augmented reality session 5Augmented reality session 5
Augmented reality session 5
 
Augmented reality session 4
Augmented reality session 4Augmented reality session 4
Augmented reality session 4
 
Augmented reality session 3
Augmented reality session 3Augmented reality session 3
Augmented reality session 3
 
Augmented reality session 2
Augmented reality session 2Augmented reality session 2
Augmented reality session 2
 

Último

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 

Último (20)

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 

Machine learning session 5

  • 1. PROGRAMMING CRASH COURSE KEDS BIODESIGNS WEEK : II CLASS : 5 SESSION : MACHINE LEARNING DATA
  • 2. MACHINE LEARNING - PREPARING DATA Introduction  Machine Learning algorithms are completely dependent on data because it is the most crucial aspect that makes model training possible. On the other hand, if we won’t be able to make sense out of that data, before feeding it to ML algorithms, a machine will be useless. In simple words, we always need to feed right data i.e. the data in correct scale, format and containing meaningful features, for the problem we want machine to solve.  This makes data preparation the most important step in ML process. Data preparation may be defined as the procedure that makes our dataset more appropriate for ML process.
  • 3. PRE-PROCESSING Why Data Pre-processing? After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm. Data Pre-processing Techniques  We have the following data preprocessing techniques that can be applied on data set to produce data for ML algorithms Scaling  Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors requires scaled data. We can rescale the data with the help of MinMaxScaler class of scikit-learn Python library.
  • 4. EXAMPLE  In this example we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in the previous chapters) and then with the help of MinMaxScaler class, it will be rescaled in the range of 0 and 1.The first few lines of the following script are same as we have written in previous chapters while loading CSV data.
  • 5. NORMALIZATION  Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data with the help of Normalizer class of scikit-learn Python library.  Types of Normalization  In machine learning, there are two types of normalization preprocessing techniques as follows −  L1 Normalization  L2 Normalization
  • 6. BINARIZATION  As the name suggests, this is the technique with the help of which we can make our data binary. We can use a binary threshold for making our data binary. The values above that threshold value will be converted to 1 and below that threshold will be converted to 0.  For example, if we choose threshold value = 0.5, then the dataset value above it will become 1 and below this will become 0. That is why we can call it binarizing the data or thresholding the data. This technique is useful when we have probabilities in our dataset and want to convert them into crisp values.  We can binarize the data with the help of Binarizer class of scikit-learn Python library
  • 7. BINARIZATION - EXAMPLE  In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of Binarizer class it will be converted into binary values i.e. 0 and 1 depending upon the threshold value. We are taking 0.5 as threshold value.  The first few lines of following script are same as we have written in previous chapters while loading CSV data.
  • 9. STANDARDIZATION  Another useful data preprocessing technique which is basically used to transform the data attributes with a Gaussian distribution. It differs the mean and SD (Standard Deviation) to a standard Gaussian distribution with a mean of 0 and a SD of 1. This technique is useful in ML algorithms like linear regression, logistic regression that assumes a Gaussian distribution in input dataset and produce better results with rescaled data. We can standardize the data (mean = 0 and SD =1) with the help of StandardScaler class of scikit-learn Python library.
  • 10. EXAMPLE  In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of StandardScaler class it will be converted into Gaussian Distribution with mean = 0 and SD = 1.  The first few lines of following script are same as we have written in previous chapters while loading CSV data.
  • 11. DATA LABELING  We discussed the importance of good fata for ML algorithms as well as some techniques to pre-process the data before sending it to ML algorithms. One more aspect in this regard is data labeling. It is also very important to send the data to ML algorithms having proper labeling. For example, in case of classification problems, lot of labels in the form of words, numbers etc. are there on the data. What is Label Encoding? Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we need to convert such labels into number labels. This process is called label encoding. We can perform label encoding of data with the help of LabelEncoder() function of scikit-learn Python library.