Denunciar

Compartilhar

•4 gostaram•654 visualizações

•4 gostaram•654 visualizações

Denunciar

Compartilhar

Baixar para ler offline

Lectures in Business Statistics II at Aarhus University BSS, year 2014.

- 1. BUSINESS STATISTICS II PART II: Lectures Weeks 11 – 19 Antonio Rivero Ostoic School of Business and Social Sciences March − May AARHUS UNIVERSITYAU
- 2. BUSINESS STATISTICS II Lecture – Week 11 Antonio Rivero Ostoic School of Business and Social Sciences March AARHUS UNIVERSITYAU
- 3. Today’s Outline Simple regression analysis Estimation in a simple regression model (we use now SPSS) 2 / 28
- 4. Introduction Galton (Darwin’s half-cousin) found in his observations that: – For short fathers, on average the son will be taller than his father – For tall fathers, on average the son will be shorter than his father Then he characterized these results with the notion of the “regression to the mean” Pearson and Lee took Galton’s law about the relationship between heights of children and parents, and came up with the regression line: son’s height = 33.73 + .516 × father’s height ª This equation shows that for additional inch of father’s height the son’s height increases on average by .516 3 / 28
- 5. Regression Analysis Regression analysis is used to predict one variable on the basis of other variables ª i.e. to forecasting It serves from a model that describes the relationship between a variable to estimate and the variables that inﬂuences this variable – Response variable is called dependent variable, y – Explanatory variables are called independent variables, x1, x2, . . . , xk Correlation analysis serves to determine whether a relationship exists or not between variables Does regression imply causation? 4 / 28
- 6. Model A model comprises mathematical equations that accurately describes the nature of the relationship between DV and IVs Example for a deterministic model: F = P(1 + i)n where F = future value of an investment P = present value i = interest rate per period n = number of periods ª In this case we determine F from the values on the equation’s right hand 5 / 28
- 7. Probabilistic model However, deterministic models can be sometimes unrealistic, since other variables that are unmeasurable and not known can inﬂuence the dependent variable Such types of variables represent uncertainty in real life and it should be included in the model In this case we rather use a probabilistic model in order to incorporate such randomness A probabilistic model then incorporates and unknown parameter called the error variable ª it accounts for all measurable and immeasurable variables that are not part of the model 6 / 28
- 8. Simple linear regression model i.e. First Order model y = β0 + β1x + where y = dependent variable x = independent variable β = coefﬁcients β0 = y-intercept β1 = slope of the line (rise/run) or (∆Y/∆X) = error variable ª Coefﬁcients are population parameters, which need to be estimated ª The assumption is that the errors are normally distributed 7 / 28
- 9. Expected values and variance for y The expected value for y it is a linear function of x, and y differs from its expected value by a random amount ª linear regression is a probabilistic model For x∗ = a particular value of x: E(y | x∗ ) = µy|x∗ (mean) V(y | x∗ ) = σ2 y|x∗ (variance) 8 / 28
- 10. Estimating the Coefﬁcients We estimate the coefﬁcients as we estimated population parameters That is draw a random sample from the population and calculate sample statistics But here the coefﬁcients are part of a straight line, and we need to estimate the line that represents ‘best’ the sample data points Least squares line ˆy = b0 + b1x here b0 = y-intercept, b1 = slope, and ˆy is the ﬁtted value of y 9 / 28
- 11. Least squares method cf. chap. 4 in Keller The least square method is an objective procedure to obtain a straight line, where the sum of squared deviations between the points and the line is minimized n i=1 (yi − ˆyi)2 The least squares line coefﬁcients b1 = sxy s2 x b0 = y − b1x 10 / 28
- 12. Least squares line coefﬁcients For b1 and b0 sxy = n i=1(xi − x)(yi − y) n − 1 s2 x = n i=1(xi − x)2 n − 1 x = n i=1 xi n y = n i=1 yi n 11 / 28
- 13. Least squares line coefﬁcients This actually means that the values of ˆy on average come closest to the observed values of y There are shortcut formula for b1 (check sample variance pp. 110, and sample covariance pp.127) b0 and b1 are unbiased estimators of β0 and β1 12 / 28
- 14. EXAMPLE 16.1 Annual Bonus and Years of Experience Determine the straight-line relationship between annual bonus and years of experience 13 / 28
- 15. Working with SPSS In SPSS we distinguish two main working windows: 1) Data Editor, where the raw data and variables are displayed 2) Statistics Viewer, where scripts and reports are provided Both windows have: MENU SUBMENU ... COMMAND Each command corresponds to a function that bears one or several ARGUMENTS 14 / 28
- 16. Working with SPSS Command-line like It is also possible to work directly with the functions Example of the script for a regression: REGRESSION /DEPENDENT dependent-variable /ENTER List-of.independents. SPSS distinguishes between COMMANDS, FILES, VARIABLES, and TRANSFORMATION EXPRESSIONS 15 / 28
- 17. Data Editor in SPSS Analyze Regression Linear 16 / 28
- 18. Report in SPSS GET FILE='C:auspssxm16-01.sav'. DATASET NAME DataSet1 WINDOW=FRONT. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Bonus /METHOD=ENTER Years. Regression Notes Output Created Comments Input Data Active Dataset Filter Weight Split File N of Rows in Working Data File Missing Value Handling Definition of Missing Cases Used Syntax 06-MAR-2014 12:45:25 C:auspssxm16-01.sav DataSet1 none none none 6 User-defined missing values are treated as missing. Statistics are based on cases with no missing values for any variable used. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Bonus /METHOD=ENTER Years. 17 / 28
- 19. Regression Report in SPSS Variables Entered/Removeda Model Variables Entered Variables Removed Method 1 Yearsb . Enter Dependent Variable: Bonusa. All requested variables entered.b. Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1 ,701a ,491 ,364 4,503 Predictors: (Constant), Yearsa. ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression Residual Total 78,229 1 78,229 3,858 ,121b 81,105 4 20,276 159,333 5 Dependent Variable: Bonusa. Predictors: (Constant), Yearsb. Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig.B Std. Error Beta 1 (Constant) Years ,933 4,192 ,223 ,835 2,114 1,076 ,701 1,964 ,121 Dependent Variable: Bonusa. 18 / 28
- 20. Regression Plot from SPSS Graphs Legacy Dialogs Scatter/Dot... Simple Scatter Years 654321 Bonus 20 15 10 5 0 y=0,93+2,11*x R2 Linear = 0,491 19 / 28
- 21. Calculation of Residuals The deviations of the actual data points to the line are the residuals, which represents observations of ei = yi − ˆyi In this case the sum of squares for error (SSE) represents the minimized sum of squared deviations ª basis for other statistics to assess how well the linear model ﬁts the data The standard error of the estimate is the square root of the proportion of SSE and the number of observations ª Remember that in SPSS the value of the residuals is given in the Anova table of the regression report 20 / 28
- 22. Annual bonus and years of experience q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus 21 / 28
- 23. Annual bonus and years of experience: Residuals q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus 2.9524 −4.1619 1.7238 −4.3905 5.4952 −1.619 22 / 28
- 24. Regression examples Finance/economy: – The enterprise equity value and total sales – Number of VP executives and total assets – Quantity of new houses and amount of jobs created in a city – Amount of bananas harvest and the density of banana trees per km2 Social/health: – Number of violent crime and the poverty rate – Amount of infectious diseases and population growth – Amount of diseases from chronic illnesses and urbanization level – Number of kinds raised and the number of spouses 24 / 28
- 25. Regression examples Miscellaneous: – IQ score development and the average global temperature per year – If a horse can run X mph, how fast will his offspring run? – Number of cigarettes smoked and number of chats having with people – Number of cigarettes smoked and time at the hospital ª (more politically correct!) That is, questions like: – For any set of values on an independent variable, what is my predicted value of a dependent variable? – If an independent variable raises its value by 1-unit, how the dependent variable results? 25 / 28
- 26. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] We do examples with random numbers... 26 / 28
- 27. Generating Random Numbers in SPSS Variable View: Create two variables for integers Data View: Choose number of observations in each variable Transform Compute Variable Arguments: Variable names in Target Variable, and Random Numbers in Function group Choose uniform rv and establish the range of the obs. values 27 / 28
- 28. Summary Simple linear regression analysis is for the relationship between two interval variables The assumption is that the variables are linearly connected The intercept and the slope of the regression line are the coefﬁcients to be estimated The least squares method produces estimates of these population parameters 28 / 28
- 29. BUSINESS STATISTICS II Lecture – Week 12 Antonio Rivero Ostoic School of Business and Social Sciences March AARHUS UNIVERSITYAU
- 30. Today’s Outline Review simple linear regression analysis Error variable in regression Model Assessment – standard error of estimate – testing the slope – coefﬁcient of determination – other measures 2 / 26
- 31. Review Simple Linear Regression Analysis Simple regression analysis serves to predict the value of a variable from the value of another variable A lineal regression model describes the variability of the data around the regression line The observations on a dependent variable y is a linear function of the observation on an independent variable x The population parameters are expressed in in two coefﬁcients, the y-intercept and the slope of the line, which need to be estimated, plus a stochastic part ª y-intercept: the value of y when x equals 0 ª slope: the change in y for one-unit increase in x 3 / 26
- 32. The Error Variable Remember that in probabilistic models we need to account for unknown and unmeasurable variables that represent noise or error The error variable is critical in estimating the regression coefﬁcients – to establish whether there is a relationship between the dependent and independent variables via an inferential method – to estimate and predict through a regression equation Errors are independent to each other and this variable is normally distributed with mean 0 and standard deviation σ ª This is expressed as ∼ N(0, σ ) 4 / 26
- 33. Expected values of y The dependent variable can be considered as a random variable normally distributed with expected values E(y) = β0 + β1x (mean) σ(y) = σ (standard deviation) Thus the mean of y depends on the value of the independent variable, whereas its standard deviation don’t shape of the distribution remains, but E(y) changes according to x 5 / 26
- 34. Experimental data and Observations We have been typically working with examples based on observations However it is also possible to perform a controlled trial where we generate experimental data Regression analysis works with both types of data, since the main goal is to determinate how the IV is related to the DV For observations both variables are random, which joint probability is characterized by the bivariate normal distribution ª here the z dimension is a joint density function of the two variables These types of normality conditions are assumptions for the estimations in a simple linear regression model 6 / 26
- 35. Assessing the Model We use the least squares method to produce the best straight line But a straight line may not be the best representation of the data We need to assess how well the linear model ﬁts the data Methods to assess the model: – standard error of estimate – the t-test of the slope – the coefﬁcient of determination all based on the SSE 7 / 26
- 36. Standard error of estimate Recall the error variable assumptions: ∼ N(0, σ ) And the model is considered poor if σ is large, and it is considered perfect when the value is 0 Unfortunately we do not know this parameter, and we need to estimate σ from the sample data The estimation is based on the sum of squares for error (SSE) ª which is the minimized sum of squared deviations between the points and the regression line SSE = n i=1 (yi − ˆyi)2 = (n − 1) s2 y − s2 xy s2 x 8 / 26
- 37. Standard error of estimate The standard error of estimate is the approximation of the conditional standard deviation of the dependent variable ª that is, the square root of the residual sum of squares divided by the number of degrees of freedom s = SSE n − 2 This is the square root of s2, which in fact is the MSE ª the df is actually number of cases − number of unknown parameters IN THE SPSS REPORT: The value for s is given in the Model Summary table for a linear regression analysis 9 / 26
- 38. Testing the slope In this case we test whether or not the dependent variable is not linearly related to the independent variable ª this means that no matter what value x has, we would obtain the same value for ˆy In other words, the slope of the line represented by β1 equals zero, and this corresponds to a horizontal line in the plot 10 / 26
- 39. Testing the slope: Uniform distribution with β1 = 0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q x y 11 / 26
- 40. Testing the slope If our null hypothesis is that there is no a linear relationship among the dependent and independent variables, then we specify H0 : β1 = 0 H1 : β1 = 0 (two-tail test) If we do not reject H0, we either committed a Type II error (wrongly accepting the null hypothesis), or there is not much of a ‘linear’ relationship between the independent variable and the dependent variable However the relationship can be a quadratic, which corresponds to a polynomial regression ª In case we want to check for a positive (β1 0) or a negative (β1 0) linear relationship among the IV and DV, then we perform a one-tail test 12 / 26
- 41. Quadratic relationship with β1 = 0 β1 = 0 x y a quadratic model: y = β0 + β1x + β2x2 + 13 / 26
- 42. Estimator and sampling distribution For drawing inferences, b1 as an unbiased estimator of β1 E(b1) = β1 with an estimated SE sb1 = s (n − 1)s2 x that is based on the sample variance of x 14 / 26
- 43. Estimator and sampling distribution If ∼ N(0, σ ) with values independent to each other, then we use the t-statistics sampling distribution Test statistics for β1 t = b1 − β1 sb1 Thus the t-statistic values are proportion of coefﬁcients to their SE IN THE SPSS REPORT: The t-statistic values are given in the Coefﬁcients table of the linear regression analysis 15 / 26
- 44. Estimator and sampling distribution Conﬁdential Interval estimator of β1 b1 ± tα/2 sb1 Test statistics and conﬁdence interval estimators are for a Student t distribution with v = n − 2 IN SPSS: Conﬁdence intervals are line Properties in the graph Chat Editor 16 / 26
- 45. Coefﬁcient of Determination To measure the strength of the linear relationship we use the coefﬁcient of determination, R2 ª useful to compare different models R2 = s2 xy s2 xs2 y This is equal to R2 = 1 − SSE (yi − y)2 17 / 26
- 46. Partitioning deviations in Example 16.1 i = 5 q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus 18 / 26
- 47. Partitioning deviations in Example 16.1 i = 5 q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus x = 3.5 y = 8.33 yi = 17 xi = 5 y^ i = 11.504 19 / 26
- 48. Partitioning deviations in Example 16.1 i = 5 q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus x = 3.5 y = 8.33 yi = 17 xi = 5 y^ i = 11.504 yi − y^ i y^ i − y yi − y xi − x 20 / 26
- 49. Partitioning deviations in Example 16.1 i = 2 q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus yi = 1 xi = 2 y^ i = 5.162 21 / 26
- 50. Partitioning the deviations (yi − y) = (ˆyi − y) + (yi − ˆyi) The difference between yi and y is a measure of the variation in the dependent variable, and it equals to: a) the difference between ˆyi and y, which is accounted by the difference between xi and x ª the variation in the DV is explained by the changes of the IV b) and the difference between yi and ˆyi, which represents an unexplained variation in x If we square all parts of the equation, and sum over all sample points, we end up with a statistic for the variation in y total SS = explained SS + residual SS ª i.e. sum of squares for regression (SSR) and the sum of squares for error (SSE) 22 / 26
- 51. Coefﬁcient of Determination R2 = 1 − SSE (yi−y)2 = (yi−y)2 (yi−y)2 − SSE (yi−y)2 = (yi−y)2 − SSE (yi−y)2 = SS(Total) − SSE SS(Total) This is the proportion of variation explained by the regression model, which is the proportion of variation in y explained by x IN THE SPSS REPORT: R2 is given in the Model Summary table of the regression analysis 23 / 26
- 52. Other measures to assess the model Correlation coefﬁcient r = sxy sxsy We use t-test for H0 : ρ = 0 t = r n − 2 1 − r2 which is t distributed with v = n − 2 and variables bivariate distributed Calculate r in SPSS Analyze Correlate Bivariate (select variables and choose Pearson) 24 / 26
- 53. Other measures to assess the model F-test F = MSR MSE for MSR = SSR/1 and MSE = SSE/(n − 2) This statistic is to test H0 : β1 = 0 IN THE SPSS REPORT: • F-statistic value is given in the Anova table • Value of r is in the Model Summary table, whereas the t statistics is given in the table for the Coefﬁcients in the regression analysis 25 / 26
- 54. Summary The error variable corresponds to the probabilistic part of the regression model ª independent values that are normally distributed with mean 0 and sd σ The standard error of estimate serves to evaluate the regression model by assessing the conditional standard deviation of the dependent variable By testing the slope we can check whether there is a linear relationship or not between the independent and the dependent variables The coefﬁcient of determination measures the strength of the linear relationship in the regression model 26 / 26
- 55. BUSINESS STATISTICS II Lecture – Week 13 Antonio Rivero Ostoic School of Business and Social Sciences March AARHUS UNIVERSITYAU
- 56. Today’s Outline The equation of the regression model Regression diagnostics 2 / 31
- 57. Regression Equation The regression equation represents the model, where the dependent variable is the response of an independent explanatory variable ª the model stands for the entire population After assessing the model, our next task is to estimate and predict the values of the dependent variable In this case we differentiate the average response at the dependent variable from the prediction of the dependent variable from a new observation in the independent variable 3 / 31
- 58. Estimating a mean value and predicting an individual value If a linear model such as y = β0 + β1x is considered satisfactory for the data, then ˆy = b0 + b1x will represent the sample equation for the estimation of the model ª (Here we predict the error term to be 0) 4 / 31
- 59. Estimating a mean value and predicting an individual value For x∗ representing a speciﬁc value of the independent variable: ˆy = b0 + b1x∗ – is the point prediction of an individual value of the dependent variable when the value of the independent variable is x∗ – is the point estimate of the mean value of the dependent variable when the value of the independent variable is x∗ 5 / 31
- 60. Interval estimators A small p-value for H0 : β1 = 0 suggests a nonzero slope in the regression line However, for a better judgment we need to see how closely the predicted value matches the true value of y There are two interval estimators: a) Prediction interval that predicts y for a given value of x b) Conﬁdence interval estimator that estimates the mean of y for a given value of x 6 / 31
- 61. Prediction interval individual intervals ª Used if we want to predict a one-time occurrence for a particular value of y when x has a given value For ˆy = b0 + b1xg the prediction interval is ˆy ± tα/2,n−2 s 1 + 1 n + (xg − x)2 (n − 1)s2 x where xg is the given value of the independent variable Another way to express this CI is x∗ → ˆy∗ , which implies that for x∗ that is a new value of x (or for a tested value of x) the prediction interval for ˆy∗ is ˆy∗ ± tα/2,n−2 MSE 1 + 1 n + (x∗ − x)2 sxx 7 / 31
- 62. Conﬁdence interval estimator the average prediction interval For E(y) = β0 + β1x (i.e. for the mean of the dependent variable) the conﬁdence interval estimator is ˆy ± tα/2,n−2 s 1 n + (xg − x)2 (n − 1)s2 x That is, for x∗ → ˆy∗ , the mean prediction interval for ˆy∗ is ˆy∗ ± tα/2,n−2 MSE 1 n + (x∗ − x)2 sxx ª where MSE equals to ˆσ 2 , whereas sxx is the unnormalized form of V(X) 8 / 31
- 63. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Data generation in SPSS • Choose your DV and IV, and number of observations. Then generate uniform random numbers: Transform Compute Variable... • Variable names in Target Variable , and Random Numbers in Function group • Select Rv.Uniform in Functions and Special Variables , and then establish the range of the observation values in Numeric Expression 9 / 31
- 64. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Conﬁdence intervals of the regression model in SPSS • We perform the linear regression analysis Analyze Regression Linear • Individual conﬁdential intervals are given in this command, where in the bottom Save we select in Prediction Intervals – the Individual option for Prediction Interval – the Mean option for the Conﬁdential Interval Estimator Both at the usual 95% value 10 / 31
- 65. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Conﬁdence intervals of the regression model in SPSS (2) • Since we have chosen Save , the conﬁdential interval values are saved in the Data Editor ª here LMCI [UMCI] and LICI [UICI] stand respectively for Lower [Upper] Mean and Individual Conﬁdence Interval The Variable View in the Data Editor gives the labels of the new variables 11 / 31
- 66. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Visualizing conﬁdence intervals in SPSS • The visualization of both types of conﬁdence intervals are possible after we plotted the variables Graphs Legacy Dialogs Scatter/Dot... Simple Scatter • From Elements Fit Line at Total of the graph Chart Editor, we look in the tab Fit Line (Properties) the options Mean and Individual in the Confidential Intervals section for the two CI estimators 12 / 31
- 67. Conﬁdence bands from SPSS Example 16.2 in Keller Odometer 50,040,030,020,010,0 Price 16,5 16,0 15,5 15,0 14,5 14,0 13,5 y=17,25+-0,07*xy=17,25+-0,07*x R2 Linear = 0,648 R2 Linear = 0,648 13 / 31
- 68. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Predict new observations in SPSS • To forecast new observations, ﬁrst we need to put the value in the dependent variable of the Data Editor • Then we choose a linear regression analysis Analyze Regression Linear • And, after we press the Save bottom, we select the Unstandardized option in Predicted Values 14 / 31
- 69. Regression Diagnostics Here we are concern with evaluating the prediction model that includes some error or noise ei = yi − ˆyi thus the residual equals each observation minus its estimated value Recall that in regression analysis there are some assumptions made for the error variable ª errors are independent to each other that are normally distributed, and hence with a constant variance 15 / 31
- 70. Regression Diagnostics A regression diagnostics checks for two things: a) whether or not the conditions for the error are fulﬁl b) for the unusual observations (those that fall far from the regression line), and determine whether or not these values results from a fault in the sampling we look at several diagnostic methods for unwanted conditions 16 / 31
- 71. Residual analysis Residual analysis focus on the differences between the observations and the predictions made in the linear model Residual Analysis in SPSS Residual analysis is based on standardized and unstandardized residuals • After choosing linear regression analysis Analyze Regression Linear • When we press the Save bottom, we select the Standardized and Unstandardized options in Predicted Values ª Recall that these values are recorded in the Data View of the Data Editor 17 / 31
- 72. Nonnormality The nonnormality check of the error variable is made by visualizing the distribution of the residuals ª we use the histogram for this Nonnormality in SPSS The histogram of residuals is obtained from Graphs Legacy Dialogs Histogram... • And we choose RES (which corresponds to the unstandardized residuals) for the Variable option 18 / 31
- 73. Nonnormality Nonnormality in SPSS (2) It is also possible to obtain the distribution shape in the histogram • In the Chart Editor we go to Elements Show Distribution and choose Normal 19 / 31
- 74. Heteroscedasticity Heteroscedasticity (or heteroskedasticity) is the term used when the assumption of equal variance of the error variable is violated ª homoscedasticity has the opposite implication, meaning ‘homogeneity of variance’ To test the heterogeneity of variance in the error variable we can plot the residuals against the predicted values of the DV ª then we look for the spreading of the points; if the variation in ei = yi − ˆyi increases as yi increases, the errors are called heteroscedastic This type of graph is sometimes called the ei − ˆyi plot 20 / 31
- 75. Heteroscedasticity Heteroscedasticity in SPSS The heteroskedasticity condition evaluated by the ei − ˆyi plot Graphs Legacy Dialogs Scatter/Dot... Simple Scatter • And choosing RES (the unstandardized residuals) for the Y-axis, and PRE (the predicted values) for the X-axis • For the mean line of the residuals in the plot we go to the Chart Editor (by double-clicking the graph in the report) and in Options Y Axis Reference Line • Select the Mean option in the Reference Line tab of Properties 21 / 31
- 76. Nonindependence of the Error variable The nonindependence of the errors means that the residuals are autocorrelated, i.e. correlated over time To detect autocorrelation we can plot the residuals in a time period and look for alternating or increment patterns ª If no clear pattern appears in the plot then there is an indication that the residuals are independent to each other Alternatively to detect lack of independence between errors without time laps, we can perform the Durbin-Watson test ª where the null hypothesis is that no correlation exists, whereas the alternative hypothesis is that a correlation exists; i.e. H0 : ρ = 0, and H1 : ρ = 0 we look at this test in multiple regression analysis... 22 / 31
- 77. Nonindependence of the Error variable Nonindependence of the error variable in SPSS We now create a time variable in the EXAMPLE-DO-IT-YOUR-SELF, and then index the observations with a vector sequence Transform Compute Variable... • Index (time) variable in Target Variable , and the Miscellaneous option in Function group • Select $Casenum in Functions and Special Variables 23 / 31
- 78. Nonindependence of the Error variable Nonindependence of the error variable in SPSS (2) After obtaining the unstandardized residuals, we plot these values... Graphs Legacy Dialogs Line... Simple • We select the Mean of the unstandardized residuals is located in the Line Represents option, and the time variable in Category Axis If we go to the Chart Editor we obtain the expected mean in Options Y Axis Reference Line 24 / 31
- 79. Outliers Outliers are unusual (small or large) observations in the sample, which lie far away from the regression line These points may suggest: an error in the sampling, a recording mistake, an unusual observation ª we should disregard the observation if case of one of the two ﬁrst possibilities To detect outliers: – we serve from scatter diagrams of the IV and DV with the regression line – we check the standardized residuals where absolute values larger than 2 may suggest an outlier 25 / 31
- 80. Outliers Detection of outliers in SPSS First we get the standardized residuals when choosing linear regression analysis Analyze Regression Linear In the bottom Save we select the Standardized in Residuals Then we obtain the absolute values of this variable • ZRE 1 in Target Variable , and choose Arithmetic in Function group • Select Abs in Functions and Special Variables and put this variable code in the parentheses 26 / 31
- 81. Inﬂuential Observations We serve from scatter diagrams of the IV and DV with the regression line as well to evaluate the impact of inﬂuential observations ª we produce two plots, one with and another without the supposed inﬂuential obs. Optionally, to detect inﬂuential observation we can use different measures as well: Leverage describes the inﬂuence each observed value has on the ﬁtted value for this observation ª where Mahalanobis distance is a measure of leverage of the observation Cook’s D (distance) detects dominant observations, either outliers or observations with high leverage ª an Inﬂuence plot is made of the Studentized Residuals (ei/SE) against the leverages of the observations (called ‘hat’ values) 27 / 31
- 82. Cook’s Distance Example 16.2 in Keller 0 20 40 60 80 100 0.000.020.040.060.080.100.12 Obs. number Cook'sdistance Cook's distance 19 74 86 28 / 31
- 83. Inﬂuence plot (example 16.2 in Keller) Areas of the circles are proportional to Cook’s distances 0.01 0.02 0.03 0.04 0.05 0.06 0.07 −2−1012 Hat−Values StudentizedResiduals q q q qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q 8 19 29 / 31
- 84. Other aspects in Regression Diagnostics • In the validation of linear model assumptions, we can also evaluate the skewness, kurtosis in the distribution shape of the residuals... • The prediction capability of the model can be assessed by looking at the predicted SSE as well (in multiple regression we also look at the collinearity among IVs) 30 / 31
- 85. Summary For a given explanatory variable, we differentiate the individual value of the response variable from its mean value Point estimation provides individual prediction intervals of the DV, and conﬁdence interval estimator approximates the mean of the response variable Regression diagnostics concerns with evaluating the prediction model and the assumptions of the error variable We look at the dominant points inducing the regression line for assessing the prediction model, whereas much of the diagnostics concentrates on the characteristics of the residuals 31 / 31
- 86. BUSINESS STATISTICS II Lecture – Week 14 Antonio Rivero Ostoic School of Business and Social Sciences 1st April 2014 AARHUS UNIVERSITYAU
- 87. Today’s Outline Scaling and transformations Standard error of estimates and standardized values Step-by-step example with simple linear regression analysis 2 / 24
- 88. Scaling and transformations Sometimes data transformation is needed in order obtain e.g. a normal distribution Transformations are mathematical adjustments applied to scores in an attempt to make the distribution of the outcomes ﬁt requirements Scaling (and re-scaling) is a linear transformation based on proportions where the scores are enlarged or reduced 3 / 24
- 89. Data transformation In a simple linear regression analysis we can perform a transformation of both the explanatory and the response variables For example in linear regression we may need to transform the data: – when the residuals have a skewed distribution or they show heteroscedasticity – to linearize the relationship among the IV and the DV – but also when the theory suggest a transformed expression – or to simplify the model in a multiple regression model 4 / 24
- 90. Scaling and transformations Examples of transformations of the variable x are: – Square root: √ x – Reciprocal: 1/x – Natural log: ln(x) or log(x) – Log 10: log10(x) In linear regression we use least squares ﬁtting ª this transformation allows the residuals to be treated as a continuous differentiable quantity 5 / 24
- 91. Logarithmic transformations linear regression analysis Model Linear Linear-log Log-linear Log-log Transformation None x = log(x) y = log(y) x = log(x) y = log(y) Regression equation y = β0 + β1x y = β0 + β1 log(x) log(y) = β0 + β1x log(y) = β0 + β1 log(x) ª log are natural logarithms with base e ≈ 2.72 ª The term ‘level’ is also used instead of ‘linear’ in logarithmic transformations 6 / 24
- 92. Logarithmic transformations linear regression analysis Model Linear Linear-log Log-linear Log-log Interpretation A one unit increase in x would lead to a β1 increase/decrease in y A one percent increase in x would lead to a β1/100 increase/decrease in y A one unit increase in x would lead to a β1 ∗ 100% increase/ decrease in y A one percent increase in x would lead to a β1% increase/decrease in y ª In econometrics, log-log relationships are referred as “elastic” and the coefﬁcient of log(x) as the elasticity 7 / 24
- 93. Standard Error of Estimates SE = square root of the proportion of the squared differences between criterion’s predicted and observed values and the df The squared differences between criterion’s predicted and observed values corresponds to the Residual SS (SSE in Anova) ª it represents the unexplained variation in the model (or model deviance) The df equals number of cases − number predictors in the model −1 ª in a simple linear regression model there is only one predictor, and df equal n − 2 Thus most of the calculation for the SE of estimates corresponds to the Residual SS 8 / 24
- 94. SE and Residual SS SSE in SPSS After having the data, to obtain the SSE we need ﬁrst the predicted values of our model Analyze Regression Linear • And in Save choose the Unstandardized option in Predicted Values 9 / 24
- 95. SE and Residual SS SSE in SPSS (2) Then we calculate by hand the residuals (yi − ˆyi) in a new variable created in the Variable View. We name this variable as RESID • Then we go to Transform Compute Variable... and place RESID in Target Variable , and make the subtraction operation with the expression: DV − PRE 1 10 / 24
- 96. SE and Residual SS SSE in SPSS (3) The next step is to obtain the square of the residuals, and we the recent created variable (named RESID) for this. Thus transformation of the residual values to their squares is obtained after we place RESID is in Target Variable and type in the Numeric Expression ﬁeld the square of the values: RESID ∗∗ 2 11 / 24
- 97. SE and Residual SS SSE in SPSS (4) The sum of squares of the residuals, which is the numerator of the SE, is obtained when we sum the values of this last variable Analyze Reports Report Summaries in Columns... and choose RESID for the Data Columns and select Display grand total in Options . The Residual SS or SSE is given in the Report of the Statistics Viewer as Grand Total. ª in SPSS the SE of estimates is given in Model Summary, and the SSE and df values are in the ANOVA table 12 / 24
- 98. Standardized values Standardized values have been transformed into a customary scale Standardized Coefﬁcient In linear regression the standardized coefﬁcient is the product of the regression coefﬁcient and the proportion of the standard deviations of the DV and the IV That is Beta (in SPSS) equals B ∗ (s(x)/s(y)) The standardized coefﬁcient represents the change in the mean of the dependent variable, in y standard deviations, for a one standard deviation increase in the independent variable 13 / 24
- 99. Standardized values Standardized Residuals In SPSS we count with various types of residuals: – RES 1 stands for unstandardized residuals – SRE 1 stands for Studentized residuals – ZRE 1 stands for standardized residuals And Keller (pp 653) tells us about the standardization of variables in general and of the residuals in particular ª subtract the mean and divide by the standard deviation 14 / 24
- 100. Standardized residuals We get the Excel output table with the standardized residuals for Example 16.2 (Keller, pp 653) Now let us look at the SPSS results for this data... ? Hmmmmmmmmmmmmmm.... ? 15 / 24
- 101. Standardized residuals The term ‘standardized residual’ is not a standardized term In Keller “Standardized” residuals are residuals divided by the standard error of the estimate (residual) (cf. pp 653) However in SPSS these values (cf. Excel output pp 653) correspond to the “Studentized” residuals ª (even though the deﬁnition is for the Studentized deleted residuals) In SPSS a standardized residual is the residual divided by the standard deviation of data ª Studentized residuals (another form for standardization) have a constant variance, and combine the magnitude of the residual and the measure of inﬂuence 16 / 24
- 102. Standardized residuals speaking the same language Residuals (unstandardized) are the difference between observations and expected values: ˆ = y − ˆy In the case of a regression model standardized residuals are normalized to a unit variance The standard deviation or the square root of the variance of the residuals corresponds to the sqrt of MSE (cf. lec. week 12) ª this is also known as the root-mean-square deviation Standardized residual = residual / √ MSE 17 / 24
- 103. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] Be aware that in this case the model is chosen in advance, and we adopt a linear relationship between two variables 18 / 24
- 104. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 1. Determine the response and the explanatory variables 2. Visualize the data through a scatter plot 3. Perform basic descriptive statistics 19 / 24
- 105. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 4. Estimate the coefﬁcients (intercept and slope) 5. Compute the ﬁtted values and the residuals 6. Obtain the sum of squares for errors (Residual SS) 20 / 24
- 106. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 7. Estimate the coefﬁcients (intercept and slope) a) standard error of estimate b) test of the slope c) coefﬁcient of determination 21 / 24
- 107. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 8. Perform the regression diagnostics a) conﬁdence regions for individual prediction intervals b) conﬁdence regions for the average prediction interval 9. Make a residual analysis a) nonnormality, heteroskedasticity, nonindependence errors 22 / 24
- 108. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 10. Detect outliers and inﬂuential observations 11. Interpret the results 12. Draw the conclusions 23 / 24
- 109. BUSINESS STATISTICS II Lecture – Week 15 Antonio Rivero Ostoic School of Business and Social Sciences April AARHUS UNIVERSITYAU
- 110. Today’s Outline Multiple regression model • coefﬁcients • estimation • conditions • testing • diagnostics Working example (SE estimates, and ﬁtting the model with logarithmic transformations) 2 / 17
- 111. Multiple regression model While a simple regression analysis has a single independent variable, in a multiple regression analysis we count with several explanatory variables for the response variable A multiple regression model is represented by the equation y = β0 + β1x1 + β2x2 + · · · + βkxk + where y is the dependent variable, x1, x2, . . . , xk are independent variables, and is the error variable ª note that independent variables may be product of transformations from other variables (which are independent or not) In this case parameters β1, β2, . . . , βk are the regression coefﬁcients, whereas β0 represents the intercept 3 / 17
- 112. Multiple regression model It is important to note that the introduced multiple regression equation represents in this case an additive model Thus the effect of each independent variable on the response is assumed to be the same for all values of the other predictor ª certainly we need to assess whether the additive assumption is realistic or not Q. Do we still considering a linear relationship in the multiple regression model? A. Yes, whenever the model has linear coefﬁcients 4 / 17
- 113. Graphical representation Multiple regression models are graphically represented by a hyperplane with k dimensions for IVs – for k = 2 the relationships between the IVs and the DV is represented by a regression plane within a 3D space – for k 2 the model is represented by a regression or response surface, a hyperplane (2D) that is not conceivable to visualize for us 5 / 17
- 114. Interpreting Coefﬁcients In the multiple regression model β0 stands for the intersection of the regression hyperplane, and represents the mean of y when x’s equal 0 ª it makes only sense if the range of the data includes zero βi, i = 1, . . . , k represent the change in the DV when xi changes one unit while keeping the other IVs constant When is possible, interpret the regression coefﬁcients as the ceteris paribus effect of their variation on the dependent variable ª i.e. “other things being equal” interpretation 6 / 17
- 115. Estimation The estimation of the coefﬁcients is given by the least squares equation ˆy = b0 + b1x1 + b2x2 + · · · + bkxk for k independent variables And the error variable is estimated as e+i = yi − ˆyi 7 / 17
- 116. Required conditions The required conditions of the error variable assumed in a simple linear regression model remain for multiple regression analysis ª that is errors are independent, normally distributed with mean 0 and a constant σ The standard error of the estimate has less df than in the simple regression analysis ª we want SE close to zero 8 / 17
- 117. Testing the regression model We test the validity of the model with the following hypotheses H0 : β1 = β2 = · · · = βk = 0 H1 : βi = 0 for at least one i ª The model is invalid in case we fail to reject the null hypothesis, whereas whenever the alternative hypothesis is accepted then the model has some validity Since in multiple regression models we count with several competing explanatory variables for a response variable, then the assessment of the model is central in the analysis 9 / 17
- 118. Testing the regression model The test of signiﬁcance of the model is based on the F statistics, which means that we focus on the variation of the outcomes The F-test is the proportion of the Mean Squares of Regression and Residual F = SSR/k SSE/n − k − 1 = MSR MSE Recall that SSR represents the explained variation in the model, whereas SSE is the unexplained variation ª we want a high value for SSR and a low value of SSE, since this indicates that most of the variation in the response variable is explained by the model 10 / 17
- 119. Testing the regression model For the F-test the rejection of H0 applies when F Fα, k, n−k−1 ª hence for a given α level we infer difference in the regression coefﬁcients in case that the F statistic value fails within the rejection region Another way to assess the model is through the coefﬁcient of determination or R2, which interpretation is similar to the simple regression analysis ª we want R2 close to one 11 / 17
- 120. Test of individual coefﬁcients Based on the test of signiﬁcance of the multiple regression model we can perform individual t tests for each regression coefﬁcient H0 : βi = 0 H1 : βi = 0 (two-tail test) The test statistic is t = bi − βi sbi 12 / 17
- 121. Test of individual coefﬁcients And the conﬁdential intervals are bi ± tα/2, n−k−1· sbi for i = 1, . . . , k We reject the null hypothesis iff t tα/2, n−k−1 (for a two-talied test) 13 / 17
- 122. Adjusted R-squared When we add explanatory variables to the multiple regression model we cannot decrease the value of the coefﬁcient of determination ª but it is possible to get a very high R2 even when the true model is not linear Thus the adjusted R-squared is often used to summarize the multiple ﬁt as it takes into account the number of variables in the model ª it is the coefﬁcient of determination adjusted for df Adjusted R2 = 1 − MSE MS Total where MSE = SSE/(n − k − 1), and MS Total is the sample variance of y Adjusted R2 ≤ R2 14 / 17
- 123. Regression diagnostics: multicollinearity In addition to nonnormality and heteroskedasticity, the regression diagnostics for a multiple model checks also for multicollinearity Multicollinearity occurs when two or more independent variables are highly correlated with one another ª hence it is very difﬁcult to separate their particular effects and inﬂuences on y It causes inﬂated standard errors for estimates of regression parameters and very large regression coefﬁcients Some consequences of this inﬂation are: – a large variability of the samples, which causes that the sample coefﬁcients may be far from the population parameters, and hence with wide conﬁdence intervals – small t statistics that suggest no linear relationship between involved variables and the response variable, and such inference may be wrong 15 / 17
- 124. Multicollinearity Multicollinearity can be avoided if one anticipates the problem from theory or past experiences ª multiple correlation scores can serve as a guide Beware that two independent variables can be highly correlated with each other (or with another predictor) but uncorrelated with the dependent variable ª they may be non-redundant suppressor variables A stepwise regression (backward and forward) can serve to minimize multicolliniearity in the modelling ª these methods are based on improving the models ﬁt 16 / 17
- 125. Multiple regression analysis WORKING EXAMPLE [Prediction of avg. Household Size in Global Cities] Multiple regression analysis using globalcity-multiple.sav 17 / 17
- 126. BUSINESS STATISTICS II Lecture – Week 17 Antonio Rivero Ostoic School of Business and Social Sciences April AARHUS UNIVERSITYAU
- 127. Today’s Outline Model building in multiple linear regression – predictors Comparing regression models Stepwise regression Working example – model building – model comparison Further issues (...) 2 / 16
- 128. Model building in multiple linear regression The main goal in model building is to ﬁt a model that explains variation of the dependent variable with a small set of predictors ª i.e. a model that efﬁciently forecasts the response variable of interest When dealing with multiple independent variables, each subset of x’s represents a potential model of explanation ª for k predictors in the data set there are 2k − 1 subsets of independent variables Thus we want to establish a linear equation that predicts ‘best’ the values of y by using more than one explanatory variable Recall that to obtain a good model we need a R2 score closer to 1, a small value for SE , and a large F statistic (which implies a small SSE) 3 / 16
- 129. Predictors There are two types of independent variables to consider, and they correspond to the numeric and the categorical variables – Factors characterize qualitative data – Covariates represent quantitative data Predictors = Factors + Covariates Sometimes an abstraction made on a numeric variable is called a factor that explains the theory in the regression model, and covariate is simply a control variable 4 / 16
- 130. Comparing Regression Models cf. F-general in Note 2 To test of whether a model ﬁts signiﬁcantly better than a simpler model In this case a restricted or reduced model is nested within an unrestricted or complete model ª i.e. one model is contained in another model The test statistics can be based on the SSE or on the R2 values for both models Fchange = (R2 U − R2 R) / df1 (1 − R2 U) / df2 where df1 = q = kU − kR (i.e. number of variable restrictions), and df2 = n − kU − 1 5 / 16
- 131. Comparing Regression Models F-general with sum of squares On the other hand, by considering the sum of squares of the residuals, the F statistics becomes Fchange = (SSER − SSEU) / df1 SSEU / df2 with the same df’s as before, and we take the absolute value SPSS We need to combine in Analyze Regression Linear the two models with a different variable selection Method (Enter and Remove in Blocks 1 and 2), and check R squared change in Statistics... 6 / 16
- 132. Comparing Regression Models nested models SPSS The syntax procedure for comparing two nested models is..: REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA CHANGE /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT y /METHOD=ENTER x1 x2 /METHOD=REMOVE x2. 7 / 16
- 133. Comparing Regression Models ...that for the data in Note 2 produces this outcome for both models: Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 2 ,55a ,304 ,297 67,45215 ,304 48,426 3 333 ,000 ,41 b ,167 ,164 73,57910 -,137 32,811 2 333 ,000 Predictors: (Constant), years potential experience, years of education, years with current employera. Predictors: (Constant), years of educationb. ª the Fchange for Model 2 is for kU = 3 and kR = 1 this statistic is also equivalent to the F score in the analysis of variance of both models 8 / 16
- 134. Stepwise regression Variable selection A sequential procedure to perform multiple regressions is found in the stepwise method It combines forward selection of predictors and backward elimination of the independent variables These are bottom-up and top-down processes based on F scores and predeﬁned p values ª defaults in SPSS are 5% for IN, and 10% for OUT 9 / 16
- 135. WORKING EXAMPLE [Average Household Size in Global Cities] Model Building (Data in globalcity-multiple.sav) 10 / 16
- 136. Avg. household size in global cities Model assessment Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 2 3 4 5 ,713a ,508 ,506 1,19059 ,508 239,764 1 232 ,000 ,760b ,578 ,574 1,10517 ,070 38,248 1 231 ,000 ,787c ,620 ,615 1,05170 ,041 25,087 1 230 ,000 ,798d ,637 ,631 1,02944 ,018 11,053 1 229 ,001 ,805e ,648 ,641 1,01542 ,011 7,367 1 228 ,007 Predictors: (Constant), Household Connection to Watera. Predictors: (Constant), Household Connection to Water, Average Income Q3 Personb. Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortalityc. Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortality, Informal Employment d. Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortality, Informal Employment, Percent Woman Heade of Households e. 11 / 16
- 137. WORKING EXAMPLE [Average Household Size in Global Cities] Comparing nested models 12 / 16
- 138. Avg. household size in global cities models 4 and 5 The F change in the two nested models is given in: Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 2 ,805a ,648 ,641 1,01542 ,648 84,113 5 228 ,000 ,798 b ,637 ,631 1,02944 -,011 7,367 1 228 ,007 Predictors: (Constant), Percent Woman Heade of Households, Informal Employment, Average Income Q3 Person, Overall Child Mortality, Household Connection to Water a. Predictors: (Constant), Informal Employment, Average Income Q3 Person, Overall Child Mortality, Household Connection to Water b. 13 / 16
- 139. Avg. household size in global cities the ﬁnal model? Estimate Std. Error t value Pr(|t|) (Intercept) 5.4130 0.3705 14.61 0.0000 x10 −0.0191 0.0031 −6.13 0.0000 x3 −0.0001 0.0000 −3.95 0.0001 x5 0.0790 0.0157 5.04 0.0000 x9 0.0131 0.0041 3.18 0.0017 x6 −0.0104 0.0038 −2.71 0.0072 And what about this other one..? y = x4 + x5 + x6 + x8 + x9 + x10 14 / 16
- 140. Further Issues multiple regression Comparison of separate models Regression diagnostics Collinearity tests Logarithmic transformations Interpretation of results 15 / 16
- 141. Summary Conclusions Find a parsimonious model that effectively explains y Model comparison combines evaluation of the ﬁts and the signiﬁcance of regression coefﬁcients ª available automated procedures To compare nested models we use the F statistics ª working example, and data in note 2 WORKING EXAMPLE: “It seems that the inclusion of the ratio of woman head of households improves the model, but does it contribute to explain the change in the average of the household size in the global cities?” 16 / 16
- 142. BUSINESS STATISTICS II Lecture – Week 18 Antonio Rivero Ostoic School of Business and Social Sciences April AARHUS UNIVERSITYAU
- 143. Today’s Outline Polynomial regression models Regression models with interaction Comparing models (note 3) Dummy variables 2 / 20
- 144. Polynomial regression Polynomial regression is a particular case of a regression model that produces curvilinear relationship between response and predictor Recall that simple regression equations represent ﬁrst-order models y = β0 + β1x + Here the order of the equation p equals 1 and the relation between the predictor and the response is depicted by a regression line ª the model has a ‘degree 1 polynomial’ We can have regression equations with several independent variables that are polynomial models and still having just one predictor variable Remember that when the parameters in the equation are linearly related, then the polynomial regression model is considered as linear 3 / 20
- 145. First order and polynomial regression models • First order model with two predictors: x1 and x2 y = β0 + β1x1 + β2x2 + • First order model with k predictors: x1, . . . , xk y = β0 + β1x1 + β2x2 + · · · + βkxk + • Polynomial model with one predictor variable x and order p y = β0 + β1x + β2x2 + · · · + βpxp + ª thus a predictor variable can have various orders or powers 4 / 20
- 146. Second-order models • A second-order (polynomial) model with a single predictor variable has p = 2 and the equation represents a quadratic response function depicted by a parabola ª a ‘degree 2 polynomial’ or quadratic polynomial y = β0 + β1x + β2x2 + β1 controls for translation parameter of the parabola, and β2 for its curvature rate 5 / 20
- 147. Quadratic effect of the regression coefﬁcient second-order model with β2x2 β2 = 1 x y convex β2 = −1 x y concave 6 / 20
- 148. Third-order models • A third-order (polynomial) model with a single predictor variable has p = 3 and the equation represents a cubic response function and depicted as a sigmoid curve ª a ‘degree 3 polynomial’ y = β0 + β1x + β2x2 + β3x3 + there are three regression coefﬁcients that control for two curvatures 7 / 20
- 149. Cubic effect of the regression coefﬁcients third-order model (β1 0 and β2 0) β3 0 x y β3 0 x y 8 / 20
- 150. Higher-order models and several predictor variables Models with order 3 are seldom used in regression analysis ª typically because of the overﬁtting in the model and the poor prediction power However, so far we have seen multiple regression equations involving several predictors that are related in an additive model ª that is, the effect of each IV was not inﬂuenced by the other variables As illustration, consider a monomial model with two predictors (from the WORKING EXAMPLE) y = 5.47 − .03 x10 + .02 x9 (avg. household size as a function of access to water and informal employment) for x9 = 1 then ˆy = 5.49 − .03 x10 for x9 = 50 then ˆy = 6.47 − .03 x10 for x9 = 99 then ˆy = 7.45 − .03 x10 9 / 20
- 151. Additive model with 2 predictors 0 20 40 60 80 100 23456789 x y y^ = 5.49 + −0.03x 10 / 20
- 152. Additive model with 2 predictors 0 20 40 60 80 100 23456789 x y y^ = 5.49 + −0.03x y^ = 6.47 + −0.03x 11 / 20
- 153. Additive model with 2 predictors 0 20 40 60 80 100 23456789 x y y^ = 5.49 + −0.03x y^ = 6.47 + −0.03x y^ = 7.45 + −0.03x 12 / 20
- 154. Comparing models Note 3 Four models: (1) ﬁrst order; (2) second order; (3) linear-log; (4) log-linear a) The t test is used to compare models (1) and (2) ª since (1) is the reduced version of (2) we can use the Fchange score for nested models where t = √ F b) Models (1) and (3) are not nested; we choose one with a better ﬁt c) Models (2) and (3) are neither nested and we rely on R 2 since they have a different number of predictors (performances are almost identical here...) d) Comparing a log-linear model with an untransformed response requires another approach and it is out of the scope... 13 / 20
- 155. Regression models with interaction Many times the effect of a certain explanatory variable on the response is affected by the value of another predictor of the model In such cases there is an interaction between the two predictors, and the inﬂuence of these variables on y does not operate in a simple additive pattern A ﬁrst order model with interaction: y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + where the effect of x1 on the response is inﬂuenced by x2 and vice-versa An interaction exists in the regression model when a regression coefﬁcient varies with a different value of another coefﬁcient ª not easy to interpret 14 / 20
- 156. Example A model with two the predictors and interaction from the WORKING EXAMPLE y = 6.58 − .04 x10 + .00 x9 + .00 x10 x9 produces no interaction because in the model b3 equals zero ª this may be explained by the high correlation between y and x9 15 / 20
- 157. Estimating multiple regression with interaction An important concern with multiple regression is that lower order variables are highly correlated with their interactions Centering and standardization of predictors correct this problem ª Centering implies re-scaling the predictors by subtracting the mean from each observation, and by dividing the centering scores with the standard deviation of the variable we standardize the predictors Model with interaction from the WORKING EXAMPLE with standardized values y = 1.11 − .50 x10 + .35 x9 + .16 x10 x9 for x9 = 1 then ˆy = 1.46 − .34 x10 for x9 = 2 then ˆy = 1.81 − .18 x10 which means that the ﬁtted lines are not parallel as with the additive model 16 / 20
- 158. Higher order models with interaction Higher order models with interaction produce quadractic, cubic (W, M or other shape) relationships between the response and each of the predictors Model with a quadratic relationship and interaction y = β0 + β1x1 + β2x2 + β3x2 1 + β4x2 2 + β5x1x2 + will produce parabolas with crossing trajectories... 17 / 20
- 159. Regression with dummy variables Until now we have been doing regression analysis using interval scales of the data only However in many cases we may count with qualitative data that are represented by a nominal scale, and treating this type of data as interval brings misleading results We can perform regression analysis by using dummy or indicator variables, which are artiﬁcial variables that encode the belonging or not of an observation to a certain group or category ª code 1 for belonging, and code 0 otherwise Indicator or dummy variables are just for classiﬁcation purposes and the magnitude used is not applicable in this context 18 / 20
- 160. Regression with dummy variables For 3 categories we use 2 indicator variables I 1 I 2 Category 1 1 0 Category 2 0 1 Category 3 0 0 For 4 categories we use 3 indicator variables... I 1 I 2 I 3 Category 1 1 0 0 Category 2 0 1 0 Category 3 0 0 1 Category 4 0 0 0 How many dummies are required for a variable having two categories? 19 / 20
- 161. Dummies with command-line We need to create a number of dummy variables according to the existing number of categories. Syntax in SPSS: RECODE varlist_1 (oldvalue=newvalue) ... (oldvalue=newvalue) [INTO varlist_2]. [/varlist_n]. EXECUTE. 20 / 20
- 162. BUSINESS STATISTICS II Lecture – Week 19 Antonio Rivero Ostoic School of Business and Social Sciences May AARHUS UNIVERSITYAU
- 163. Today’s Outline Qualitative Variables Regression Models: Testing and Interpreting Results • indicators • multiple • interaction • (polynomial) • logarithmic transformations 2 / 24
- 164. Qualitative independent variables The effects of qualitative information on a response variable may be an important result, and we need ways to include this type of data in a regression model Qualitative information correspond to a nominal scale that my require a pre-coding of the data into artiﬁcial variables known as dummies or indicator variables Recall that a nominal scale includes different categories or groups that serve to classify the observations, and qualitative predictors are factors A dichotomous factor has two categories (e.g. gender), whereas a polytomous factor has more categories (e.g. seasons) 3 / 24
- 165. Indicator variables (dummies) Indicator variables have only two values, typically 1 and 0, and for m categories in the variable, we require m − 1 indicator variables ª this means that there is an omitted category in the representation to avoid redundancy Ii = 1 if obs. belongs to a category ci 0 otherwise. The omitted category represents the baseline or ‘reference’ category to which we compare the other groups ª the decision to choose the omitted category is arbitrary, and it leads to the same conclusion If we do not omit one category and include indicator variables for all categories in the regression model, then there is a perfect multicollinearity among these independent variables ª a phenomenon known as the dummy variable trap 4 / 24
- 166. Dataset for Notes 3 and 4 training data337.sav Dependent variable: Wage, average hourly earnings (DKK) Independent variables: Educ, education (years) Tenur, current employment (years) Exper, potential experience (years) Female, gender (0: male, 1: female) (Male, gender (0: female, 1: male)) 5 / 24
- 167. Simple regression with an indicator variable (dichotomous factor) “The gender wage gap” Are women paid less than men according to the data? Wage = β0 + β1 Female + Estimate Std. Error t value Pr(|t|) (Intercept) 161.9242 5.5013 29.43 0.0000 Female -62.8700 8.1117 -7.75 0.0000 Women earn 62.87 DKK per hour less than men 6 / 24
- 168. Simple regression with an indicator variable II (dichotomous factor) For a variable Male = 1 − Female, and the model: Wage = β0 + β1 Male + we get the following results: Estimate Std. Error t value Pr(|t|) (Intercept) 99.0542 5.9612 16.62 0.0000 Male 62.8700 8.1117 7.75 0.0000 Likewise men earn 62.87 DKK per hour more than women 7 / 24
- 169. The dummy variable trap What about this model?: Wage = β0 + β1Female + β2Male + In this case there is a duplicated category and the independent variables are perfectly multicollinear Male is an exact linear function of Female and of the intercept ª Male = 1 − Female implies that Male + Female = 1 8 / 24
- 170. Multiple Regression with a dichotomous indicator variable (factor and covariates) An additive dummy-regression model: Wage = β0 + β1Female + β2Educ + β3Tenure + • (We already know that the model ﬁt or R2 never decreases when we add to the model new independent variables) • The model now assumes that – besides gender – there is an effect of education and tenure on the wage levels • Since the model is additive the predictors are independent to each other, and the regression equation ﬁts identical slopes for all the categories in gender and for the other predictors as well ª which implies parallel regression lines in the scatterplot 9 / 24
- 171. Testing partial coefﬁcients For model: Wage = β0 + β1Female + β2Educ + β3Tenure + Test the partial effect of gender: H0 : β1 = 0 H1 : β1 = 0 Test the partial effect of education: H0 : β2 = 0 H1 : β2 = 0 Test the partial effect of tenure: H0 : β3 = 0 H1 : β3 = 0 10 / 24
- 172. Testing partial coefﬁcients The t-test is the coefﬁcient divided by the SE of the estimate ti = bi − βi sbi Estimate Std. Error t value Pr(|t|) (Intercept) -49.2529 20.5869 -2.39 0.0173 Female -46.7547 7.1544 -6.54 0.0000 Education 13.9233 1.4564 9.56 0.0000 Tenure 3.2485 0.4729 6.87 0.0000 11 / 24
- 173. Fitted values by gender: Additive model Wage = β0 + β1Female + β2Educ + β3Tenure + years of education 18161412108 UnstandardizedPredictedValue 300,00000 200,00000 100,00000 ,00000 Fit line for Total Female Male R2 Linear = 0,435 Linear Regression 12 / 24
- 174. Fitted values by gender: Additive model Wage = β0 + β1Female + β2Educ + β3Tenure + years of education 18161412108 UnstandardizedPredictedValue 300,00000 200,00000 100,00000 ,00000 Female Male Male: R2 Linear = 0,575 Female: R2 Linear = 0,715 Linear Regression 13 / 24
- 175. Multiple regression with interaction: factor and covariate (indicator variable and continuous variable) • Many times the additive models are unrealistic, and theory suggest different slopes for different categories • To capture such difference in slopes we assume statistical interaction among independent variables Wage = β0 + β1Female + β2Educ + β3(Female × Educ) + Estimate Std. Error t value Pr(|t|) (Intercept) -18.1088 26.1498 -0.69 0.4891 Female -23.9223 41.7171 -0.57 0.5667 Educ 13.7154 1.9550 7.02 0.0000 Female × Educ -2.6485 3.1844 -0.83 0.4062 ª The effect of gender on wage is inﬂuenced by education and vice-versa (no sig.) 14 / 24
- 176. Fitted values by gender: Interaction model Wage = β0 + β1Female + β2Educ + β3(Female × Educ) + years of education 18161412108 UnstandardizedPredictedValue 250,00000 200,00000 150,00000 100,00000 50,00000 ,00000 Female Male Male: R2 Linear = 1 Female: R2 Linear = 1 Linear Regression 15 / 24
- 177. Testing interaction We can test for interaction in the model Wage = β0 + β1Female + β2Educ + β3(Female × Educ) + The null hypothesis is that there is no interaction in the model, i.e. H0 : β3 = 0 H1 : β3 = 0 We apply now the F-general (or F incremental) statistics... Fchange = (R2 U − R2 R) / df1 (1 − R2 U) / df2 where df1 = q = kU − kR (i.e. number of variable restrictions), and df2 = n − kU − 1 ª In this case the complete or unrestricted model has the statistical interaction term whereas the reduced model does not have this term 16 / 24
- 178. Testing interaction In an additive dummy-regression model it is possible to test for effect of categorical variable on the response controlling for a quantitative predictor, and vice-versa ( i.e. test for effect of a covariate controlling for factor) e.g. test gender on wage controlling for education, and test education controlling for gender In such cases the null hypothesis is that the coefﬁcient of the variable to be tested equals zero 17 / 24
- 179. Multiple Regression with a polytomous indicator variable Data from Keller xm16-02.sav A polytomous indicator variable has more than two categories: Price = β0 + β1Odometer + β2Color + I1 = 1 if colour is white 0 otherwise. I2 = 1 if colour is silver 0 otherwise. • The reference category is ‘all other colours’ that is represented whenever I1 = I2 = 0 18 / 24
- 180. Multiple Regression with a polytomous indicator variable • In a multiple regression with a polytomous indicator variable we obtain coefﬁcients each group except for the reference category Estimate Std. Error t value Pr(|t|) (Intercept) 16.8372 0.1971 85.42 0.0000 Odometer -0.0591 0.0051 -11.67 0.0000 White 0.0911 0.0729 1.25 0.2143 Silver 0.3304 0.0816 4.05 0.0001 • The t-test is adequate for the covariate (i.e. odometer), but for color we prefer to test the two indicator variables simultaneously, and this is because the election of the reference category is arbitrary ª the F test allow us to do this • Part of the interpretation of the results assumes that one or more indicator variables equal 0 19 / 24
- 181. EXERCISE: MULTIPLE REGRESSION WITH A POLYTOMOUS INDICATOR VARIABLE [MBA data from Keller xm18-00.sav] 20 / 24
- 182. Interpreting Results Recall that the interpretation in regression analysis is on average, it considers the units of measure of the involved variables, and in additive models is by holding constant the values of the other variables (including the error) In regression with indicator variables the coefﬁcients corresponding to these variables represent a variation on the response with respect to the other groups in the model The statistical signiﬁcance of the regression coefﬁcients comes after the interpretation of their effects on the response and not alone The conclusions should account for the values of the regression coefﬁcients and the statistical signiﬁcance of these outcomes 21 / 24
- 183. Interpreting logarithmic transformations log is a natural logarithm base e In Note 3 models (3) and (4) have logarithmic transformations on variables, and we will see how to interpret the results in these models Model (3), level-log Wage = β0 + β1Educ + β2 log(Tenure) + In this linear-log model, a one percent increase in years of experience (tenure) leads to β2/100% change in wage unit ∆Wage % ∆Tenure = b2/100 ª Since b2 = 31.32, then –holding education constant– a one percent change in tenure is associated with 0.3132 DKK increase hourly in wage on average 22 / 24
- 184. Interpreting logarithmic transformations Model (4), log-level log(Wage) = β0 + β1Educ + β2Tenure + This is a log-linear model where a one unit increase in the predictor leads to bi ∗ 100% change on wage % ∆Wage unit ∆xi = bi ∗ 100 • Holding education constant, a one year increase with current employer is associated with 2.5% increase in wage per hour on average • Holding tenure constant, 1 year more of education is associated with 10.4% increase in wage per hour on average 23 / 24
- 185. Interpreting logarithmic transformations log-log models are interpreted as elasticity i.e. the ratio percent change in one variable to the percent change in another variable % ∆y % ∆xi • One percent change in xi is associated with bi% change in y (ceteris paribus) partial elasticity when we hold constant other variables 24 / 24
- 186. BUSINESS STATISTICS II Lecture – Week 20 Antonio Rivero Ostoic School of Business and Social Sciences May AARHUS UNIVERSITYAU
- 187. Today’s Outline Exam 2013 • Comparing groups (Q1) • Regression analysis (Q3 and Q6) 2 / 27
- 188. Basic terminology data... quantitative = interval qualitative = categorical, nominal for regression... Dependent variable, y = response variable; prediction; predicted variable, ˆy Independent variable, x = explanatory variable; predictor; factor (qualitative), covariate (quantitative) 3 / 27
- 189. Exam 2013 The exam 2013 had 8 questions, and some were based on a single data set The data set contained 13 labor market related variables (though one transformed) among 762 observations form men and women ª however not all variables were needed to answer the questions After you read carefully the instructions, check the data with the software, and put labels to the variables with the provided descriptions and units of measure (if speciﬁed) 4 / 27
- 190. Comparing groups Q1a) Do wages diﬀer by gender? • Implied variables: Wage K, and gender B • Groups to compare: Wages for men and wages for women Plot data in SPSS ª Plot histogram for wage grouped by gender (B) Graphs Legacy Plots Histogram where the variable is paneled by the two groups (optional normal curve)... 5 / 27
- 191. Comparing groups Q1a) Do wages diﬀer by gender? We compare the means of these two groups through the t-test However we need to see ﬁrst whether these groups have equal variances or not through ª to know whether to use the pooled or the unpooled version of the t test Thus we perform the F test for equality of variances ﬁrst Obtain basic descriptive statistics in SPSS Analyze Reports Case Summaries... where the variable is paneled by the two groups... ª uncheck Display Cases and choose statistics 6 / 27
- 192. Review: F test and sample variance H0 : σ2 1/σ2 2 = 1 H1 : σ2 1/σ2 2 = 1 F = σ2 1/s2 1 σ2 2/s2 2 = s2 1 s2 2 for v1 = n1 − 1 and v2 = n2 − 1 where for 1, 2, ..., n observations: variance s2 = n i=1(xi − x)2 n − 1 7 / 27
- 193. Review: t test and sample mean independent samples and H0 : µ1 = µ2 pooled t = (x1 − x2) − (µ1 − µ2) s2 p 1 n1 + 1 n2 where s2 p = (n1 − 1)s2 1 + (n2 − 1)s2 2 n1 + n2 − 2 unpooled t = (x1 − x2) − (µ1 − µ2) s2 1 n1 + s2 2 n2 v = n1 + n2 − 2 when σ2 1 = σ2 2 v = s2 1/n1 + s2 2/n2 2 (s2 1/n1)2 n1−1 + (s2 2/n2)2 n2−1 when σ2 1 = σ2 2 which for 1, 2, ..., n observations: mean ¯x = n i=1 xi n 8 / 27
- 194. F test to wages by gender After obtained the F statistics, we check the critical values with the respective degrees of freedom and the standard alpha value ª use the Excel calculator or/and table for F-distribution In this case the F ratio is within the critical region, which means that we reject H0 of equal variances, i.e. F ratio = 1 ª the p-value indicates that the result is statistically signiﬁcant Both outcomes suggest that there evidence to infer that the ratio of variances differ We know now that we can proceed with the analysis applying the unpooled t test 9 / 27
- 195. t test to wages by gender Q1a) Do wages diﬀer by gender? Although in this part the calculations are written by hand; you can compare your results with the outcomes from SPSS t test in SPSS Analyze Compare Means Independent-Samples T Test... and the test variable K is paneled by the two groups in B ª We Deﬁne Groups... by putting 0 and 1 that characterize the gender variable Conﬁdence intervals are also given in the table of the t test for independent samples... 10 / 27
- 196. Comparing groups Q1b) Find a 95% conﬁdence interval for tenure by gender • Implied variables: Tenure G, and gender B • Groups to compare: Tenure for men and for women In this case is the pooled t test with the conﬁdence interval estimators 11 / 27
- 197. Review: Conﬁdence intervals for t test pooled Conﬁdence interval estimator of µ1 − µ2 when σ2 1 = σ2 2 (x1 − x2) ± tα/2 s2 p · 1 n1 + 1 n2 for v = n1 + n2 − 2 12 / 27
- 198. Comparing groups Q1c) Find a 95% CI by gender with 15 years of education • Implied variables: Education I, and gender B • Groups to compare: Men and women, 15 yrs. of educ. In this case the difference is between population proportions 13 / 27
- 199. Review: Conﬁdence Interval of p1 − p2 (ˆp1 − ˆp2) ± zα/2 ˆp1(1 − ˆp1) n1 + ˆp2(1 − ˆp2) n2 for unequal proportions, and n1ˆp1, n1(1 − ˆp1), n2ˆp2, and n2(1 − ˆp2) 5 For the number of successes in the two populations, x1 and x2 ˆp1 = x1 n1 and ˆp2 = x2 n2 14 / 27
- 200. Test of proportion p1 − p2 As when we compared means, the calculations for proportions should be made manually. However to ﬁnd x1, x2, and n1, n2 you can use SPSS Proportion success in SPSS Analyze Descriptive Statistics Crosstabs... where I is contrasted by B - Alternatively, you can create new indicator variable, say PL15 = 1 iff I 3 Transform Recode into Different Variables , and in Old and New Values... recode to 1 the Range category 3 through 4, and 0 otherwise, after naming the new variable Then get a report Analyze Reports Case Summaries... where PL15 is the Variable that is grouped by B, and specifying Number of Cases and Sum in Statistics 15 / 27
- 201. Test of proportion p1 − p2 By combining the two categorical variables, we obtain the sample estimates for both proportions that are men and women ª and we are able to pursue with the arithmetic calculations For 95% conﬁdence interval, the multiplier zα/2 is 1.96 ª the score comes from the z table in Keller for 1 − (.05/2) 16 / 27
- 202. Comparing groups Summary comment Q1a. Men earn signiﬁcantly more than women Q1b. With 95% conﬁdence the difference interval of 2.8 to 4, men have more years of market experience than women Q1c. With 95% conﬁdence the difference interval of 10 to 24 percentage points, men have less schooling level than women • Relate the implied variables of wage, tenure, and education level for both groups • Explain why the differences might occur in that way... ª eventually using other variables from the data 17 / 27
- 203. Regression analysis Q4a) Estimation and regression diagnostics for an additive log linear regression model Dependent variable: M, the natural logarithm of K, wage (hourly) Independent variables: B, gender (male = 1, female = 0) C, education (years) G, market experience or tenure (years) 18 / 27
- 204. Regression analysis Q4. where log = ln The regression equation represents a log level model: ln(Wage) = β0 + β1 Male + β2 Educ + β3 Tenure + Estimate Std. Error t value Pr(|t|) (Intercept) 4.4353 0.0594 74.65 0.0000 Male 0.1268 0.0132 9.63 0.0000 Educ 0.0431 0.0031 13.79 0.0000 Tenure 0.0089 0.0019 4.69 0.0000 19 / 27
- 205. Model diagnostics Multiple regression After performing linear regression analysis... Check the assumptions | x ∼ N(0, σ2 ) and evaluate multicollinearity by: • looking at the correlation among the variables • viewing the histogram of the standardized residuals for the model • plotting the residuals against predicted values 20 / 27
- 206. Regression results Q4b) Interpretation of the estimation results The ﬁtted model is ˆy = 4.435 + .127 · B + .043 · C + .009 · G This means that, men earn 12.7% more than women, and that wages raise by 4.3% and by almost 1% for an extra year of education and market experience respectively ª interpretation as ceteris paribus or all things being equal Then we interpret individual outcomes in the log level model as the proportion of percentage change in y by a unit change in xi 21 / 27
- 207. Regression results Q4c. However sometimes we need ﬁtted values in the units of measure of the untransformed response given a set of values in the IVs ª e.g., How much wage is expected for a man or a woman having 12 years of education and 15 years of experience? In this case we apply the exponential function to both hands of the regression equation eln(ˆy) = e(b0+b1x1+b2x2+b3x3) where e ≈ 2.718282 It means for the model that we obtain the value of K rather than the value of ln(K) 22 / 27
- 208. Regression results Q4c. The ﬁtted value for a man with 12 years of education, and 15 years of market experience is ˆy = 4.435 + .127 · 1 + .043 · 12 + .009 · 15 = 5.213 and the expected return on wage is e5.213 = 183.64 hourly (in DKK) On the other hand, for a woman with similar level of education and experience the ﬁtted value is ˆy = 4.435 + .127 · 0 + .043 · 12 + .009 · 15 = 5.086 and the expected return on wage is e5.086 = 161.69 hourly (in DKK) 23 / 27
- 209. Prediction interval Manual regression analysis Q6) Construct a 95% prediction interval for y given x = 30 Where the ﬁtted line for n = 100 and R2 = .755 is ˆy = 6.92 + .237x Some descriptives for location and dispersion are: • x = 13 and s2 x = 121 • y = 10 and s2 y = 9 And the Anova table shows: • SSR = 672.61, df = 1 • SSE = 218.39, df = 98 24 / 27
- 210. Prediction interval Q6. The prediction interval for ˆy∗ | x∗ = 30 ˆy∗ x∗=30 ± tα/2, n−k−1 · MSE · 1 + 1 n + (x∗ − x)2 (n − 1) · s2 x∗ where MSE = s2 = SSE n−k−1 ª In this case, the left hand of the prediction interval estimate correspond to the product of the regression coefﬁcients with k = 1 ª the multiplier can be obtained from the MS Excel calculator for the t distribution 25 / 27
- 211. Prediction interval Q6. without Anova table It is also possible to calculate SSE from the sample variances and R2 SSE = (n − 1) s2 y − s2 xy s2 x where the covariance s2 xy = R2 · s2 x · s2 y (this is from R2 = s2 xy s2 xs2 y ) Or alternatively: SSE = SSy · (1 − R2 ) where SSy = (n − 1) · s2 y Thus there are various possibilities fot the calculations... 26 / 27
- 212. Thank you Good luck!