3. Importance.
Where you should clean your data in your
research process?
Data cleaning and screening is the step that directly
follows data entry and you must not start your analysis
unless doing it.
Data screening importance:
It is very easy to make mistakes when entering data.
Some errors can miss up your analysis.
So, it is important to spend the time for checking for
the mistakes initially, rather than trying to repair the
damage later, try another person to check your data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
4. Data screening steps
1) Check out the abnormal data (data within out of
range) from frequencies table.
2) Go back to the original questionnaire and
correct them.
Hassan Mohamed Cairo University- Statistical
Package, 2016
5. Data cleaning
Data cleaning includes:
Missing data
Normality
Linearity
Outliers
Multicollinearity
Homoscedasticity
Hassan Mohamed Cairo University- Statistical
Package, 2016
6. Missing data
- If Missing data comes from data entry:
You can detect it from the frequencies of the variable
(missing #)
Then sort your data ascending or descending.
Then you got the IDs of missing values
Go back and try to fill it.
Run your descriptive analysis again.
Hassan Mohamed Cairo University- Statistical
Package, 2016
7. Missing data (cont.)
- If the data entry comes from respondent errors;
respondent was ambiguous
Respondent forgot to answer the question.
• And missing data are more than 10% of the total
values of the variable that has missing data. Then
don’t treat with the missing data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
8. Missing data (cont.)
• If the missing values are less than 10%:
• You can deal with it:
1. Substitute it with the neutral value. (Malhotra, 2010)
2. Substitute with an imputed value: (hair et al.,2010)
Imputation using only valid data: Exclude cases
listwise
Complete data. (Least preferable under 10% of
missing data)
All available data.
Hassan Mohamed Cairo University- Statistical
Package, 2016
9. Missing data (cont.)
Imputation using known replacement values:
Case substitute.
Hot and Cold Deck imputation (most similar case, or
best known value)
Imputation by calculating replacement values: Replace
with……
Mean substitution
Regression imputation (prediction equation of the
valid data)
This option should never be used, as it can severely
distort the results of your analysis.
Hassan Mohamed Cairo University- Statistical
Package, 2016
10. Missing data (cont.)
Or
Exclude cases pairwise (recommended)
Excludes the case only if they are missing the
data required for the specific analysis. But still
included in any other analysis. (Pallant, 2011)
Hassan Mohamed Cairo University- Statistical
Package, 2016
11. Normality
The shape of the data distribution for an individual
metric variable.
Used to describe a symmetrical, bell-shaped curve,
which has the greatest frequency of scores in the
middle with smaller frequencies towards the extremes
It is a must for any parametric analysis.
Normal distribution can be negligible if the sample size
more than 50 respondents.
Hassan Mohamed Cairo University- Statistical
Package, 2016
12. Normality (Cont.)
Normality measures:
Kurtosis:
Peakedness (Leptokurtic) or flatness (Platykurtic) of
the distribution compared to the normal distribution.
In normal distribution the kurtosis value is zero
(allowed to ±10)
Skewness:
The balance of the distribution
Positive distribution (left skewed) or Negative
distribution (right skewed).
In normal distribution the skewness value is zero
(allowed to ±3)Hassan Mohamed Cairo University- Statistical
Package, 2016
13. Normality (Cont.)
5% Trimmed Mean and mean values.
Kolmogorov-Smirnov and Shapiro-Wilk values are more
than 0.05 indicates the normality. But it is very sensitive
for the sample size more than 200.
Form the Pell shape in the histogram.
Transformation can fix the nonnormal
distribution.
Hassan Mohamed Cairo University- Statistical
Package, 2016
14. Linearity
It is for multivariate techniques based on correlational
measures of association including multiple regression.
(hair et al., 2010)
The relationship between the two variables should be
linear. This means that when you look at a scatterplot
of scores you should see a straight line (roughly), not
a curve (Curvilinear). (pallant, 2011).
Transformation can overcome the Curvilinear issue
(hair et al., 2010)Hassan Mohamed Cairo University- Statistical
Package, 2016
15. Linearity (cont.)
So, shouldn’t transform your data to avoid non normal
distribution If your sample more than 50.
But you should transform the data to avoid
curvilinearity.
Hassan Mohamed Cairo University- Statistical
Package, 2016
16. Outliers
These are case scores that are extreme and therefore
have a much higher impact on the outcome of any
statistical analysis.
It is not an error in your data, but it makes your data
non representative its population (Income)
Can be detected using Box plots.
Outliers come from: (Hair et al.,2010; Tabachnick &
Fidell, 1996)
There was a mistake in data entry (a 6 was entered as
66, etc.)
The missing values code was not specified and missing
values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical
Package, 2016
17. Outliers (cont.)
Outliers come from: (Hair et al.,2010; Tabachnick &
Fidell, 1996)
There was a mistake in data entry (a 6 was entered as
66, etc.)
The missing values code was not specified and missing
values are being read as case entries (99 in spss)
The outlier is not part of the population from which you
intended to sample:
extraordinary event (remove it).
Extraordinary observation (take your decision
depending on your valid cases) (close to eliminate)
Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical
Package, 2016
18. Outliers (cont.)
The outlier is part of the population you wanted but in the
distribution it is seen as an extreme case.
In this case you have three choices:
1) delete the extreme cases
2) change the outliers’ scores so that they are still extreme
but they fit within a normal distribution (for example: make
it a unit larger or smaller than last case that fits in the
distribution)
3) if the outliers seem to part of an overall non-normal
distribution than a transformation can be done but first
check for normality
Hassan Mohamed Cairo University- Statistical
Package, 2016
19. Outliers (cont.)
The outliers should be retained to ensure the
generalizability of population unless they are not
representative the population.
So, again shouldn’t transform your data to avoid non
normal distribution If your sample more than 50.
But you should transform the data to avoid outliers.
Hassan Mohamed Cairo University- Statistical
Package, 2016