Regression Analysis is simplified in this presentation. Starting with simple linear to multiple regression analysis, it covers all the statistics and interpretation of various diagnostic plots. Besides, how to verify regression assumptions and some advance concepts of choosing best models makes the slides more useful SAS program codes of two examples are also included.
2. Index
• What is Regression Analysis?
• Simple Regression Theory
• Example 1: House Price Model
• Run Simple Regression Using SAS
• Steps & Assumptions of Regression
• Multiple Regression Analysis
• Significance Testing
• Coefficient of Determination
• Example 2: Credit Card Model
• Model selection
• Verify Regression Assumptions
• Regression Diagnostics
• Run Multiple Regression Using SAS
2
3. WHAT IS REGRESSION ANALYSIS?
• When two or more things are related to each other and we want to quantify
the relationship between them, regression analysis is the right technique
• It goes beyond correlation by creating a mathematical equation to estimate or
predict the values within the range framed by the data
• The regression procedure demands at least one dependent and one or more
independent variables
• Dependent variable (also known as outcome or response variable) is built
upon independent variables (also called explanatory or predictor variable)
3
• Associative relationships between these variables
is analyzed by Regression Analysis
• It is commonly used in forecasting, time series
modelling, financial analysis, and market
research to find the causal effect relationship
between variables
Scatter Diagram
4. SIMPLE REGRESSION THEORY
• Let’s begin with simple linear regression which is easier to understand
• Remember ‘y=mx+c’ linear equation from high school which make the plot,
fitting a straight line to data
• In simple regression, this equation is modified to ‘y=β0 + β1x + ε’, where y is a
dependent variable and x is independent variable
• β0 same like y-intercept c is the estimated value of y when x is zero, while β1
similar to slope of line m is the estimated change in the average value of y as a
result of a unit change in x and ε is the error
• The error is needed because the regression model is based on sample rather
population (usually sample estimators are not close to the population mean)
• That is why Ordinary Least-Squares (OLS) procedure is used for selecting the
model parameters (β0 and β1) that minimize the sum of the squared
differences between y and ŷ and determine the best-fitting line
• The objective is always to minimize the error, which is difference between the
observed and the predicted values generated by the model ‘ŷ=b0 + b1x’ 4
5. EXAMPLE 1: HOUSE PRICE MODEL
• A real estate company
wants to examine the
relationship between the
selling price of a home (in
$1000s) and its size (in
square feet) for a specific
region.
• It selects a random sample
of 10 houses
• The scatterplot with the
data points shows the
positive linear relationship
• Higher the size of house
means higher the price of
the house
5
6. STATISTICS: HOUSE PRICE MODEL
6
Dependent
Variable (Y)
House Price
(in $1000s)
R-Square 0.5808Dependent Mean 286.5
Independent
Variable (x)
Size (in
square feet)
Adj R-Sq 0.5284Coeff Var 14.42594
Parameters 2Root MSE 41.33032Observations 10
Analysis of Variance (ANOVA)
Source DF Sum of Squares Mean Sqaure F Value Pr > F
Model 1 18935 18935 11.08 0.0104
Error 8 13666 1708.19565
Corrected Total 9 32601
Parameter Estimates
Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 98.24833 58.03348 1.69 0.1289
X Size 1 0.10977 0.03297 3.33 0.0104
7. INTERPRETATION: HOUSE PRICE MODEL
7
• First, look at the ANOVA results in which Pr value is lesser than 0.05, meaning
the null hypothesis is rejected
• Second, R-Sqaure value is 0.58082 which means that 58.08% of the variation
in house prices is explained
• The regression model makes sense only when it fits the data better than the
baseline model, meaning the slope of the regression line is not equal to zero
• From the parameter estimates, House Price Model is ŷ= 98.24833 + 0.10977x
• Since the prices are in once thousand dollars, for each square feet, the
average value of house increases by 0.10977 ($1000) = $109.77
• For example, the expected price of a 2000 square feet house would be
98.24833 + 0.10977x2000 ($1000) = $219,638.20
• The estimation and prediction should happen only within the range of data
that was used for the regression analysis, else results are doubtful
• The remaining statistics will be discussed in Multiple Regression Analysis
8. RUN SIMPLE REGRESSION USING SAS
8
• Copy and paste above code in the program of SAS software
DATA House;
input Y X;
label Y = 'House Price in $';
label X = 'Size in Square Feet';
datalines;
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
;
ods graphics on;
title1 'Simple Regression
Analysis';
Title2 'House Price Model';
proc reg
PLOTS(ONLY)=FITPLOT;
model Y = X;
run;
ods graphics off;
title;
9. STEPS & ASSUMPTIONS OF REGRESSION
Step 1 Formulate the problem
Step 2 Define dependent & independent variables
Step 3 Build the general model
Step 4 Plot the scatter diagram
Step 5 Estimate the parameters
Step 6 Estimate the regression coefficient
Step 7 Test for significance
Step 8 Find the strength of the association
Step 9 Check the prediction accuracy
Step 10 Examine the residuals
Step 11 Cross-validate the model
9
• Linearity of the phenomenon
measured, meaning the mean
of dependent variable is linearly
related to independent variable
• Error are normally distributed
with a mean of zero
• Errors have equal variances, or
in other words the error term is
constant (Homoscedasticity)
• Error are independent, meaning
uncorrelated
10. 10
• More powerful as it involves single dependent variable and two or more
independent variables
• The dependent variable should be in interval-scale and other variables in
metric or appropriately transformed
• Analyze the impact of a set of independent variables on the dependent
variable.
• The equation for multiple regression is ‘y=β0 + β1x1 + β2x2 +…+ βnxn + εn’,
where y is a dependent variable and x1,x2,xn are the independent variables
• The predicted values generated by the model ‘ŷ=b0 + b1x1 + b2x2 +…+ bnxn ’
where b0, b1, b2, and b0 are the estimators of β0, β1, β2 and βn
• The model parameters are estimated using Ordinary Least-Squares (OLS)
procedure which minimize the sum of the squared differences between y and
ŷ and determine the best-fitting line
• Before performing multiple regression, it is always recommended to check the
correlation among variables to avoid multicollinearity issue
MUTLIPLE REGRESSION ANALYSIS
11. • To provide justification for accepting or rejecting a given hypothesis
• In ANOVA, the null hypothesis is that all population means are equal and the
alternative hypothesis is that not all of the population means are equal. It is
assumed that the populations are normal and that they have equal variances.
11
SIGNIFICANCE TESTING
• To test the hypothesis, F ratio is calculated which has to be higher than the
Fisher distribution statistics (based on sample size), proving the model fit the
data better than the baseline model
• The results has p-value which should be lower 0.05 to confirm the probability
that relationship exists between dependent and independent variables
• Testing for the significance of the model parameters can be done in a manner
similar but using t test statistics
• In regression, there are three
types of sums of squares:
variation explained by model
(SSM), unexplained variation error
(SSE), and total variation (SST)
12. 12
COEFFICIENT OF DETERMINATION
• Coefficient of determination (R2)
explains the strength of association
• R2 = SSM / SST
• It measures the percentage of the
variation in dependent variable
that is explained by the
independent variable
• The value of R2 closer to 1 means
regression line fits perfectly
whereas the value closer to 0
doesn’t fit the data well
• R2 value will keep increasing if we
add more independent variables to
the model and results can be
misleading
• After adding the first few variables,
additional independent variables
do not make much contribution
• Adjusted R2 tells the percentage of
variation explained by only the
independent variables that actually
affect the dependent variable
• For example, in below R2 values,
variables more than 3 does not add
any value to the model
13. EXAMPLE 2: CREDIT CARD MODEL
• A bank wants to predict the number of credit cards that a family uses (Y)
based on the following data – Family Number (ID), Family Size (X1), Family
income in thousand dollars (X2), and Number of automobiles owned (X3)
• A sample of 8 families is used in the analysis
• The objective is to find a better predicting value with minimum prediction
error squared
13
Family
ID
Actual No. of
Credit Cards (Y)
Baseline Prediction
(ȳ=ŷ)
Prediction Error
(y-ȳ)
Prediction Error squared
(y-ȳ)2
1 4 7 -3 9
2 6 7 -1 1
3 6 7 0 1
4 7 7 1 0
5 8 7 0 1
6 7 7 1 0
7 8 7 3 1
8 10 7 0 9
Total 56 (Y/N=56/8) 0 22
14. STATISTICS: CREDIT CARD MODEL
14
Dependent Variable
(Y)
No. of Credit
Cards
R-Square 0.8614
Dependent
Mean
7.0
Independent Variables
(X1 & X2)
Family Size &
Family Income
Adj R-Sq 0.8059Coeff Var 11.157
Parameters 3Root MSE 0.78099Observations 8
Analysis of Variance (ANOVA)
Source DF Sum of Squares Mean Sqaure F Value Pr > F
Model 2 18.95027 9.47514 15.53 0.0072
Error 5 3.04973 0.60995
Corrected Total 7 22
Parameter Estimates
Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 0.48169 1.46141 0.33 0.7551
X1 Family Size 1 0.63224 0.25231 2.51 0.0541
X2 Family Income 1 0.21585 0.10801 2 0.1021
15. INTERPRETATION: CREDIT CARD MODEL
15
• ANOVA results shows that Pr value is lesser than 0.05, meaning the null
hypothesis is rejected and the relationship exists between Y1 and X1 & X2
• In this model, variation explained by model is 3.04953 which is lesser than
baseline model (where predicted error squared is 22)
• R-Sqaure value is 0.8614 which means that 86.14% of the variation in credit
cards is explained by this model
• When we included X3, the adjusted R-square decreased. Hence, we did not
include X3 in this model as it was statistically insignificant.
• From the parameter estimates, ŷ= 0.482 + 0.63*X1 + 0.216*X2
• Assuming the family size (X1) is 4 and its annual income (X2) is 17.5. the
predicted number of credit cars would be 6.782 (using above equation). Here
0.218 is the error if the value is round off and made it to 7 credit cards
• The estimation and prediction should happen only within the range of data
that was used for the regression analysis, else results are doubtful
16. MODEL SELECTION
16
• For effective modeling, one should always choose the best model, validate
regression assumptions, detect influential observations and check collinearity.
• Let’s understand model selection. Weather run regression manually or using
stepwise selection, the objective is to always have better model which can
explain more variation (R-square value closer to 1 is expected)
• Stepwise Regression is used often when there are many variables because this
method chooses the best possible combination of variables automatically,
based on their p-values.
• Below is the summary of statistics which shows how each variable entered in
the model influenced R-square and Adjusted R-square.
• When X3 entered into the model, the Adjusted R-square reduced, suggesting
to drop the variable from the model
Variables entered in model R-Square Adjusted R-Square F Value Pr > F
X1 0.7506 0.7091 18.06 0.0054
X2 0.8614 0.8059 15.53 0.0072
X3 0.8720 0.7761 9.09 0.0294
17. VERIFY REGRESSION ASSUMPTIONS
17
• To confirm the normality of the error term, check the histogram and
distribution curves
• Looking at Residual Plot, one can verify other two assumptions, equal variance
and independence, if errors are randomly plotted
• In the previous slide, the intercept was 0.482 (when intercept is not zero, the
linearity assumption is already verified
18. Influential observations
The R-square value can be affected by
outliers or influential observations. It
is necessary to look at Rstudent Plot.
Usually, values greater than two is
considered as outlier (3 for large
sample size). Cook’s D, DFFITS and
DFBETAS are other useful statistics.
Multicollinearity
It occurs when two or more
independent variables are highly
correlated with each other, which
leads to instability in the regression
model. To measure the magnitude of
collinearity in a model, VIF (Variance
Inflation Factor) is used and its
accepted values are up to 10.
18
REGRESSION DIAGNOSTICS
Variables VIF
X1 1.82692
X2 1.93492
X3 1.09976
In the credit card example, there is
absence of collinearity issue as VIF
values are lower than 8
19. RUN MULTIPLE REGRESSION USING SAS
19
• Copy and paste above code in the program of SAS software
DATA CreditCard;
INPUT ID Y X1 X2 X3;
LABEL ID = ‘Family Number’;
LABEL Y = ‘Number of Credit Cards‘;
LABEL X1 = ‘Family Size‘;
LABEL X2 = ‘Family income in $000‘;
LABEL X3 = ‘Number of cars owned‘;
DATALINES;
1 4 2 14 1
2 6 2 16 2
3 6 4 14 2
4 7 4 17 1
5 8 5 18 3
6 7 5 21 2
7 8 6 17 1
8 10 6 25 2
;
ODS GRAPHICS ON;
TITLE1 'Multiple Regression Analysis';
TITLE2 'Credit Card Model';
PROC REG
PLOTS(ONLY)=RESIDUALHISTOGRAM
PLOTS(ONLY)=RESIDUALBYPREDICTE
D
PLOTS(ONLY)=RSTUDENTBYPREDICTE
D
PLOTS(ONLY)=COOKSD
PLOTS(ONLY)=DFFITS
PLOTS(ONLY)=DFBETAS
PLOTS(ONLY)=DIAGNOSTICSPANEL;
MODEL Y = X1 X2;
RUN;
ODS GRAPHICS OFF;
TITLE;