5. Data cleaning Tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
6. Missing data
•Not availability of data
•Equipment malfunctioning
•Inconsistent, thus deleted
•Data not entered
•Certain data may not be important at the time of entry
How to handle missing data?
•Manually entry
•Attribute mean
•Standardization
•normalization
7. Noisy data
•Noise: Random error or variance
•Faulty data collection instruments
•Data entry problems
•Data transmission problems
•Technology limitation
•Inconsistency in naming convention
•Duplicate record
Methods
•Binning method
•Clustering
•Regression
•Combined computer and human inspection
8. Binning
Equal-width (distance) partitioning
divide range into N intervals of equal size
width of interval= (b-A)/N
Limitation
outliers dominate
skewed data
Equal-depth(frequency) partitioning
divides range into N intervals, each with same number of samples
good data scaling
For ex. 4,8,9,15,21,2,1,24,25,26,28,29,34
Bin 1:4,8,9,15
Max-min/3==34-4/3==10 range
9. Data Conditioning
Data conditioning includes cleaning data, normalizing datasets, and
performing transformations
Often viewed as a preprocessing step prior to data analysis, it might be
performed by data owner, IT department, DBA, etc.
Best to have data scientists involved
Data science teams prefer more data than too little
What are the data sources? Target fields?
How clean is the data?How consistent are the contents and files?
Missing or inconsistent values?
Assess the consistence of the data types – numeric, alphanumeric?
Review the contents to ensure the data makes sense
Look for evidence of systematic error
10. Survey and Visualize
Leverage data visualization tools to gain an overview of the data
Shneiderman’s mantra:“Overview first, zoom and filter, then details-on-
demand”
This enables the user to find areas of interest, zoom and filter to find more
detailed information about a particular area, then find the detailed data in
that area
11. Review data to ensure calculations are consistent
Does the data distribution stay consistent?
Assess the granularity of the data, the range of values, and the level of
aggregation of the data
Does the data represent the population of interest?Check time-related variables
– daily, weekly, monthly? Is this good enough?
Is the data standardized/normalized? Scales consistent?
For geospatial datasets, are state/country abbreviations consistent
TOOLS
Hadoop can perform parallel ingest and analysis
Alpine Miner provides a graphical user interface for creating analytic workflows
OpenRefine (formerly Google Refine) is a free, open source tool for working with
messy data,
Similar to OpenRefine, Data Wrangler is an interactive tool for data cleansing an
transformation
12. Data Integration
The process of combining multiple sources into a single
dataset. The Data integration process is one of the main
components in data management. There are some problems
to be considered during data integration.
Schema integration: Integrates metadata(a set of data that
describes other data) from different sources.
Entity identification problem: Identifying entities from
multiple databases. For example, the system or the use
should know student _id of one database and student_name
of another database belongs to the same entity.
Detecting and resolving data value concepts: The data taken
from different databases while merging may differ. Like the
attribute values from one database may differ from another
database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
13. Data Reduction
This process helps in the reduction of the volume of the data
which makes the analysis easier yet produces the same or almost
the same result. This reduction also helps to reduce storage
space. There are some of the techniques in data reduction are
Dimensionality reduction, Numerosity reduction, Data
compression.
Dimensionality reduction: This process is necessary for real-
world applications as the data size is big. In this process, the
reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and
merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space
and computation time is reduced. When the data is highly
dimensional the problem called “Curse of Dimensionality”
occurs.
14. Data Reduction cntd
Numerosity Reduction: In this method, the representation of
the data is made smaller by reducing the volume. There will not
be any loss of data in this reduction.
Data compression: The compressed form of data is called data
compression. This compression can be lossless or lossy. When
there is no loss of information during compression it is called
lossless compression. Whereas lossy compression reduces
information but it removes only the unnecessary information
16. Case study: Data processing with ML in Python
Getting the dataset
Importing libraries
Importing datasets
Finding Missing Data
Encoding Categorical Data
Splitting dataset into training and test
set
Feature scaling
18. Code cntd.
#handling missing data(Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent varibles x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
#for Country Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
19. Code cntd
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()
#encoding for purchased variable
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
#Feature Scaling of datasets
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)