Data cleaning and screening

Mohamed, Hassan Mohamed Hussein
Business administration department
Faculty of Commerce
Cairo University
Egypt
2016
Data screening and cleaning

Agenda
 Importance.
 Data screening steps.
 Data cleaning
 Missing data
 Normality
 Linearity
 Outliers
 Multicollinearity
 Homoscedasticity
Hassan Mohamed Cairo University- Statistical
Package, 2016

Importance.
Where you should clean your data in your
research process?
 Data cleaning and screening is the step that directly
follows data entry and you must not start your analysis
unless doing it.
 Data screening importance:
 It is very easy to make mistakes when entering data.
 Some errors can miss up your analysis.
 So, it is important to spend the time for checking for
the mistakes initially, rather than trying to repair the
damage later, try another person to check your data.
Package, 2016

Data screening steps
1) Check out the abnormal data (data within out of
range) from frequencies table.
2) Go back to the original questionnaire and
correct them.
Package, 2016

Data cleaning
 Data cleaning includes:
 Missing data
 Normality
 Linearity
 Outliers
 Multicollinearity
 Homoscedasticity
Package, 2016

Missing data
- If Missing data comes from data entry:
 You can detect it from the frequencies of the variable
(missing #)
 Then sort your data ascending or descending.
 Then you got the IDs of missing values
 Go back and try to fill it.
 Run your descriptive analysis again.
Package, 2016

Missing data (cont.)
- If the data entry comes from respondent errors;
 respondent was ambiguous
 Respondent forgot to answer the question.
• And missing data are more than 10% of the total
values of the variable that has missing data. Then
don’t treat with the missing data.
Package, 2016

• If the missing values are less than 10%:
• You can deal with it:
1. Substitute it with the neutral value. (Malhotra, 2010)
2. Substitute with an imputed value: (hair et al.,2010)
 Imputation using only valid data: Exclude cases
listwise
 Complete data. (Least preferable under 10% of
missing data)
 All available data.
Package, 2016

 Imputation using known replacement values:
 Case substitute.
 Hot and Cold Deck imputation (most similar case, or
best known value)
 Imputation by calculating replacement values: Replace
with……
 Mean substitution
 Regression imputation (prediction equation of the
valid data)
 This option should never be used, as it can severely
distort the results of your analysis.
Package, 2016

Or
 Exclude cases pairwise (recommended)
 Excludes the case only if they are missing the
data required for the specific analysis. But still
included in any other analysis. (Pallant, 2011)
Package, 2016

Normality
 The shape of the data distribution for an individual
metric variable.
 Used to describe a symmetrical, bell-shaped curve,
which has the greatest frequency of scores in the
middle with smaller frequencies towards the extremes
 It is a must for any parametric analysis.
 Normal distribution can be negligible if the sample size
more than 50 respondents.
Package, 2016

Normality (Cont.)
 Normality measures:
 Kurtosis:
 Peakedness (Leptokurtic) or flatness (Platykurtic) of
the distribution compared to the normal distribution.
 In normal distribution the kurtosis value is zero
(allowed to ±10)
 Skewness:
 The balance of the distribution
 Positive distribution (left skewed) or Negative
distribution (right skewed).
 In normal distribution the skewness value is zero
(allowed to ±3)Hassan Mohamed Cairo University- Statistical
Package, 2016

Normality (Cont.)
 5% Trimmed Mean and mean values.
 Kolmogorov-Smirnov and Shapiro-Wilk values are more
than 0.05 indicates the normality. But it is very sensitive
for the sample size more than 200.
 Form the Pell shape in the histogram.
Transformation can fix the nonnormal
distribution.
Package, 2016

Linearity
 It is for multivariate techniques based on correlational
measures of association including multiple regression.
(hair et al., 2010)
 The relationship between the two variables should be
linear. This means that when you look at a scatterplot
of scores you should see a straight line (roughly), not
a curve (Curvilinear). (pallant, 2011).
 Transformation can overcome the Curvilinear issue
(hair et al., 2010)Hassan Mohamed Cairo University- Statistical
Package, 2016

Linearity (cont.)
 So, shouldn’t transform your data to avoid non normal
distribution If your sample more than 50.
 But you should transform the data to avoid
curvilinearity.
Package, 2016

Outliers
 These are case scores that are extreme and therefore
have a much higher impact on the outcome of any
statistical analysis.
 It is not an error in your data, but it makes your data
non representative its population (Income)
 Can be detected using Box plots.
 Outliers come from: (Hair et al.,2010; Tabachnick &
Fidell, 1996)
 There was a mistake in data entry (a 6 was entered as
66, etc.)
 The missing values code was not specified and missing
values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical
Package, 2016

Outliers (cont.)
 Outliers come from: (Hair et al.,2010; Tabachnick &
Fidell, 1996)
 There was a mistake in data entry (a 6 was entered as
66, etc.)
 The missing values code was not specified and missing
values are being read as case entries (99 in spss)
 The outlier is not part of the population from which you
intended to sample:
 extraordinary event (remove it).
 Extraordinary observation (take your decision
depending on your valid cases) (close to eliminate)
 Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical
Package, 2016

Outliers (cont.)
 The outlier is part of the population you wanted but in the
distribution it is seen as an extreme case.
 In this case you have three choices:
1) delete the extreme cases
2) change the outliers’ scores so that they are still extreme
but they fit within a normal distribution (for example: make
it a unit larger or smaller than last case that fits in the
distribution)
3) if the outliers seem to part of an overall non-normal
distribution than a transformation can be done but first
check for normality
Package, 2016

Outliers (cont.)
 The outliers should be retained to ensure the
generalizability of population unless they are not
representative the population.
 So, again shouldn’t transform your data to avoid non
normal distribution If your sample more than 50.
 But you should transform the data to avoid outliers.
Package, 2016

Thank You
Package, 2016

Data cleaning and screening

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Data cleaning and screening

Similar to Data cleaning and screening (20)

Recently uploaded

Recently uploaded (20)

Data cleaning and screening