O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.

O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.

O slideshow foi denunciado.

Gostou da apresentação? Compartilhe-a!

- Brief Introduction to the 12 Steps ... by Jennifer Morrow 13189 views
- Data Cleaning Techniques by Amir Masoud Sefidian 29027 views
- Data cleansing by kunaljain1701 27205 views
- Data Cleaning Process by InfoCheckPoint 7441 views
- Kofi nyanteng cleaning and screni... by Kofi Kyeremateng ... 212 views
- SPSS Training In Banashankari by 9845626261 232 views

Sem downloads

Visualizações totais

6.533

No SlideShare

0

A partir de incorporações

0

Número de incorporações

11

Compartilhamentos

0

Downloads

175

Comentários

12

Gostaram

7

Nenhuma nota no slide

- 1. Mohamed, Hassan Mohamed Hussein Business administration department Faculty of Commerce Cairo University Egypt 2016 Data screening and cleaning
- 2. Agenda Importance. Data screening steps. Data cleaning Missing data Normality Linearity Outliers Multicollinearity Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
- 3. Importance. Where you should clean your data in your research process? Data cleaning and screening is the step that directly follows data entry and you must not start your analysis unless doing it. Data screening importance: It is very easy to make mistakes when entering data. Some errors can miss up your analysis. So, it is important to spend the time for checking for the mistakes initially, rather than trying to repair the damage later, try another person to check your data. Hassan Mohamed Cairo University- Statistical Package, 2016
- 4. Data screening steps 1) Check out the abnormal data (data within out of range) from frequencies table. 2) Go back to the original questionnaire and correct them. Hassan Mohamed Cairo University- Statistical Package, 2016
- 5. Data cleaning Data cleaning includes: Missing data Normality Linearity Outliers Multicollinearity Homoscedasticity Hassan Mohamed Cairo University- Statistical Package, 2016
- 6. Missing data - If Missing data comes from data entry: You can detect it from the frequencies of the variable (missing #) Then sort your data ascending or descending. Then you got the IDs of missing values Go back and try to fill it. Run your descriptive analysis again. Hassan Mohamed Cairo University- Statistical Package, 2016
- 7. Missing data (cont.) - If the data entry comes from respondent errors; respondent was ambiguous Respondent forgot to answer the question. • And missing data are more than 10% of the total values of the variable that has missing data. Then don’t treat with the missing data. Hassan Mohamed Cairo University- Statistical Package, 2016
- 8. Missing data (cont.) • If the missing values are less than 10%: • You can deal with it: 1. Substitute it with the neutral value. (Malhotra, 2010) 2. Substitute with an imputed value: (hair et al.,2010) Imputation using only valid data: Exclude cases listwise Complete data. (Least preferable under 10% of missing data) All available data. Hassan Mohamed Cairo University- Statistical Package, 2016
- 9. Missing data (cont.) Imputation using known replacement values: Case substitute. Hot and Cold Deck imputation (most similar case, or best known value) Imputation by calculating replacement values: Replace with…… Mean substitution Regression imputation (prediction equation of the valid data) This option should never be used, as it can severely distort the results of your analysis. Hassan Mohamed Cairo University- Statistical Package, 2016
- 10. Missing data (cont.) Or Exclude cases pairwise (recommended) Excludes the case only if they are missing the data required for the specific analysis. But still included in any other analysis. (Pallant, 2011) Hassan Mohamed Cairo University- Statistical Package, 2016
- 11. Normality The shape of the data distribution for an individual metric variable. Used to describe a symmetrical, bell-shaped curve, which has the greatest frequency of scores in the middle with smaller frequencies towards the extremes It is a must for any parametric analysis. Normal distribution can be negligible if the sample size more than 50 respondents. Hassan Mohamed Cairo University- Statistical Package, 2016
- 12. Normality (Cont.) Normality measures: Kurtosis: Peakedness (Leptokurtic) or flatness (Platykurtic) of the distribution compared to the normal distribution. In normal distribution the kurtosis value is zero (allowed to ±10) Skewness: The balance of the distribution Positive distribution (left skewed) or Negative distribution (right skewed). In normal distribution the skewness value is zero (allowed to ±3)Hassan Mohamed Cairo University- Statistical Package, 2016
- 13. Normality (Cont.) 5% Trimmed Mean and mean values. Kolmogorov-Smirnov and Shapiro-Wilk values are more than 0.05 indicates the normality. But it is very sensitive for the sample size more than 200. Form the Pell shape in the histogram. Transformation can fix the nonnormal distribution. Hassan Mohamed Cairo University- Statistical Package, 2016
- 14. Linearity It is for multivariate techniques based on correlational measures of association including multiple regression. (hair et al., 2010) The relationship between the two variables should be linear. This means that when you look at a scatterplot of scores you should see a straight line (roughly), not a curve (Curvilinear). (pallant, 2011). Transformation can overcome the Curvilinear issue (hair et al., 2010)Hassan Mohamed Cairo University- Statistical Package, 2016
- 15. Linearity (cont.) So, shouldn’t transform your data to avoid non normal distribution If your sample more than 50. But you should transform the data to avoid curvilinearity. Hassan Mohamed Cairo University- Statistical Package, 2016
- 16. Outliers These are case scores that are extreme and therefore have a much higher impact on the outcome of any statistical analysis. It is not an error in your data, but it makes your data non representative its population (Income) Can be detected using Box plots. Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996) There was a mistake in data entry (a 6 was entered as 66, etc.) The missing values code was not specified and missing values are being read as case entries (99 in spss)Hassan Mohamed Cairo University- Statistical Package, 2016
- 17. Outliers (cont.) Outliers come from: (Hair et al.,2010; Tabachnick & Fidell, 1996) There was a mistake in data entry (a 6 was entered as 66, etc.) The missing values code was not specified and missing values are being read as case entries (99 in spss) The outlier is not part of the population from which you intended to sample: extraordinary event (remove it). Extraordinary observation (take your decision depending on your valid cases) (close to eliminate) Neutral value for all variables (close to retain)Hassan Mohamed Cairo University- Statistical Package, 2016
- 18. Outliers (cont.) The outlier is part of the population you wanted but in the distribution it is seen as an extreme case. In this case you have three choices: 1) delete the extreme cases 2) change the outliers’ scores so that they are still extreme but they fit within a normal distribution (for example: make it a unit larger or smaller than last case that fits in the distribution) 3) if the outliers seem to part of an overall non-normal distribution than a transformation can be done but first check for normality Hassan Mohamed Cairo University- Statistical Package, 2016
- 19. Outliers (cont.) The outliers should be retained to ensure the generalizability of population unless they are not representative the population. So, again shouldn’t transform your data to avoid non normal distribution If your sample more than 50. But you should transform the data to avoid outliers. Hassan Mohamed Cairo University- Statistical Package, 2016
- 20. Thank You Hassan Mohamed Cairo University- Statistical Package, 2016

Nenhum painel de recortes público que contém este slide

Parece que você já adicionou este slide ao painel

Criar painel de recortes

Entre para ver os comentários