Chapt 11 & 12 linear & multiple regression minitab
1. 3/22/2010
IE 609
Chapter 11
The Relation between
Simple Linear Regression and
Two Sets of Measures
Correlation
1 2
The Relation between The Relation between
Two Sets of Measures Two Sets of Measures
• Construct a scatter diagram for the following • Plot Results
data:
3 4
The Relation between The Relation between
Two Sets of Measures Two Sets of Measures
• You might have reversed the axes so that the • Linear or Straight Line Relationship
vertical dimension represented the midterm
grade and the horizontal dimension, the final
grade.
d
• When one measure may be used to predict
another, it is customary to represent the
predictor on the horizontal dimension (the x-
axis).
5 6
1
2. 3/22/2010
The Relation between The Relation between
Two Sets of Measures Two Sets of Measures
• Other relationships • Which of the diagrams represents the stronger
relationship?
7 8
The Relation between
Two Sets of Measures
• Which of the diagrams represents the stronger
relationship?
Simple Linear Regression
y = α + βx
y = a + bxi + εi
9 10
Simple Linear Regression Simple Linear Regression
Minitab Data Entry
Calc ‐> Column Statistics
Table 11.1
Pg 393
11 12
2
3. 3/22/2010
Simple Linear Regression Simple Linear Regression
Calc ‐> Calculator (Create Formula, Store Variable: Residual
13 14
Simple Linear Regression Simple Linear Regression
Graph ‐> Probability Plot Residuals appear Normally Distributed
15 16
Linear Regression Linear Regression and Correlation
Simple Structure Simple Structure
Question…….. Is the sample mean of Demand the
correct value to use for ŷ?
yi = ŷ + εi y i = ŷ + εi
ŷ → Sample mean = 34.0606 (Minitab*) – Although it might seem to be a trivial question,
εi = yi - ŷ (Minitab “Residual”) you might ask why the sample mean (y-bar) was
i h k h h l ( b )
the correct value to use for ŷ ?
Sample Variance of ŷ = (10.7)2
– Since the purpose of the is to accurately describe
the yi then we would expect the model is to
* Mean of Demand , y (%) = 34.0606 deliver small errors (that is, εi) but how should we
go about making small errors?
17 18
3
4. 3/22/2010
Linear Regression Linear Regression
Simple Structure Simple Structure
Question…….. Is the sample mean of Demand the
correct value to use for ŷ? Σεi2 = Σ(ŷ – y)2
y i = ŷ + εi
The calculus operation that delivers this solution is
– A logical choice is to pick ŷ, which might be
different from the sample mean, so that the error
variance s2 calculated with εi = yi - ŷ is
minimized. n
i
2
This is called the method of least squares because the method
s2 i 1 minimizes the error sum of squares.
n 1
19 20
Linear Regression Linear Regression
Simple Structure Simple Structure
Now consider the scatter diagram below. y appears to increase linearly with respect to x
y = α + βx
• The parameters α and β are the y axis intercept and
slope, respectively.
• Since we typically have sample data and not the
complete population of (x, y) observations, we cannot
expect to determine α and β, exactly- they will have
to be estimated from the sample data. Our model is of
the form
There might be an underlying causal relationship between x and y of the form:
y = a + bxi + εi
y = α + βx
21 22
Linear Regression Linear Regression
Simple Structure Simple Structure
εi = (yi – ŷ) = yi – (a + bxi)
• Then for any choice of a and b the εi may be • Although this equation allows us to calculate the ε, for a
determined from given (x, yi) data set once a and b, are specified, there
are still an infinite number of a and b, values that could
εi = (yi – ŷ) = yi – (a + bxi ) be used in the model. Clearly the choice of a and b, that
provides the best fit to the data should make the εi or
some function of them small. Although many conditions
• These errors or discrepancies εi , are also can be stated to define best fit lines by minimizing the εi ,
called the model residuals. by far the most frequently used condition to define the
best fit line is the one that minimizes Σεi2.
23 24
4
5. 3/22/2010
Linear Regression and Correlation Linear Regression
Simple Structure Simple Structure
• That is, the best fit line for the (x, y) data, is called the • The error variance for linear least squares regression is
linear least squares regression line, which corresponds to given by
the choice of a and b, that minimizes Σεi2.
• The calculus sol tion to this problem is given by the
calc l s solution gi en b
simultaneous solution to the two equations: where n is the number of (xi, yi) observations and sε is
n 2 n 2 called the standard error of the model.
i 0
a i 1
i 0
b i 1 • The Equation has n- 2 in the denominator because two
degrees of freedom are consumed by the calculation of the
• The method of fitting a line to (xi, yi) data using the regression coefficients a and b from the experimental data.
solution is called linear regression.
25 26
Linear Regression
REGRESSION COEFFICIENTS
Simple Structure
• Think of the error variance sε2 in the regression • With the condition to determine the a and b,
problem in the same way as you think of the sample
variance s2 used to quantify the amount of variation in
values that provide the best fit line for the
simple measurement data. (xi, yi) data, namely the minimization of Σεi2,
• Whereas the sample variance characterizes the scatter
we proceed to determine a and b in a more
of observations about a single value y y the error
variance in the regression problem characterizes the rigorous manner.
distribution of values about the line ŷi = a + bxi
• Sε2 and s2 are close cousins, they are both measures of
the errors associated with different models for different
kinds of data.
27 28
REGRESSION COEFFICIENTS
• The calculus method that determines the
REGRESSION unique values of a and b, that minimize Σεi2
COEFFICIENTS requires that we solve the simultaneous
equations:
i
Determining the unique values of a and b n 2 n 2
i 0
a i 1
i 0
b i 1
29 30
5
6. 3/22/2010
REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS
• From these equations the resulting values of a • SSX, and SSY are just the sums of squares
required to determine the variances of the x
and b, are best expressed in terms of sums of and y values
q
squares:
SS xy
b a y bx
SS x
n
SS y ( yi y ) 2
n
SS x ( xi x ) 2
n n
SS xy ( xi x ) ( yi y ) SS y ( yi y ) 2
SS xy i 1
i 1
b i 1
n
i 1
SS x
SS x ( xi x ) 2
i 1
31 32
REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS
• Similarly, using the sum of squares notation, • Another important implication of Equations
we can write the error sum of squares for the
regression as SS xy a y bx
b
SS x
• and the standard error as: • that the point ( x, y ) fall on the best-fit line.
This is just a consequence of the way the sums
of squares are calculated
33 34
LINEAR REGRESSION
REGRESSION COEFFICIENTS
ASSUMPTIONS
Stats > Regression > Fitted Line Plot
s2=SSE/(n‐2)
a y bx
ŷ=
35 36
6
7. 3/22/2010
LINEAR REGRESSION LINEAR REGRESSION
ASSUMPTIONS ASSUMPTIONS
• A valid linear regression model requires that
five conditions are satisfied:
l. The values of x are determined without error.
2. The εi, are normally distributed with mean με= 0 for all
values of x.
3 . The distribution of the εi, has constant variance σε2 for all
values of x within the range of experimentation (that is,
homoscedasticity)
4. The εi are independent of each other.
5. The linear model provides a good fit to the data
37 38
HYPOTHESIS TESTS FOR
REGRESSION COEFFICIENTS
HYPOTHESIS TESTS FOR • The values of the intercept and slope a and b
REGRESSION COEFFICIENTS found with Equations
SS xy a y bx
b
SS x
are actually estimates for the true parameters
β
α and β
α α β
0 0
Hypothetical distributions for α and β
39 40
HYPOTHESIS TESTS FOR HYPOTHESIS TESTS FOR
REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS
• Although linear regression analysis will always return a and b
values. it's possible that one or both of these values could be
statistically insignificant. We require a formal method of
testing α and β to see if they are different from zero.
Hypotheses for these tests are:
H th f th t t
β β
α α0
0
H0: α0 = 0
H1: α0 ≠ 0
Hypothetical distributions for α and β
H0: β 0 = 0
Both of these distributions follow Student's t distribution with H1: β 0 ≠ 0
degrees of freedom equal to the error degrees of freedom.
To perform these tests we need some idea of the amount of variability
present in the estimates of α and β
41 42
7
8. 3/22/2010
HYPOTHESIS TESTS FOR HYPOTHESIS TESTS FOR
REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS
• The hypothesis tests can be performed using
• Estimates of the variances σα0 and σβ0 are given one-sample t tests with dfε = n -2 degrees of
by: freedom with the t statistics.
sα2 =
t
s
and
sβ2 =
t
s
Microsoft Equation
43 3.0 44
HYPOTHESIS TESTS FOR HYPOTHESIS TESTS FOR
REGRESSION COEFFICIENTS REGRESSION COEFFICIENTS
• The (1 -α) 100% confidence intervals for α and • It is very important to realize that the variances of a and b as given
are proportional to the standard error of the fit Sε. This means that if
β are determined from there are any uncontrolled variables in the experiment that cause the
standard error to increase. there will be a corresponding increase in
the standard deviations of the regression coefficients. This could
P(a - tα/2sa < α < a + tα/2sa ) = 1- α
/2 /2 make the regression coefficients disappear into the noise.
k th i ffi i t di i t th i
• Always keep in mind that the model's ability to predict the
regression coefficients is dependent on the size of the standard error.
P(b - tα/2sb < β < b+ tα/2sb ) = 1- α Take care to remove or control or account for extraneous variation
so that you get the best predictions from your models with the least
effort.
with n -2 degrees of freedom.
Microsoft Equation
3.0 45 46
CONFIDENCE LIMITS FOR THE CONFIDENCE LIMITS FOR THE
REGRESSION LINE REGRESSION LINE
Stat > Regression > Fitted Line Plot menu.
• The true slope and intercept of a regression line are You will have to select Display Confidence Bands in the Options menu to add the
not exactly known. confidence limits to the fitted line plot.
• The (l – α) 100% confidence interval for the
regression line is given by:
i li i i b
47 48
8
9. 3/22/2010
PREDICTION LIMITS FOR THE PREDICTION LIMITS FOR THE
OBSERVED VALUES OBSERVED VALUES
Stat>Regression> Fitted Line Plot menu.
• The prediction interval provides prediction bounds You will have to select Display Prediction Bands in the Options menu
for individual observations. The width of the
prediction interval combines the uncertainty of the
position of the true line as described by the
confidence interval with the scatter of points about
the line as measured by the standard error.
where tα/2 has dfε= n - 2 degrees of freedom
49 50
CORRELATION
Coefficient of Determination r2
• A comprehensive statistic is required to measure the fraction
of the total variation in the response y that is explained by
the regression model.
CORRELATION • The total variation in y taken relative to y-bar is given by SSy
= Σ(yi – y ) but SSy is partitioned into two terms: one that
COEFFICIENT OF DETERMINATION (r2) accounts for the amount of variation explained by the straight
line model given by SSregression and another that accounts for
CORRELATION COEFFICIENT (r) the unexplained error variation given by
.
. n
SS ( yi y ) 2 i
2
i 1
51 52
CORRELATION
Coefficient of Determination r2 CORRELATION COEFFICIENT (r)
• The three quantities are related by: • The correlation coefficient r is given by the square root
of the coefficient of determination r2 with an
appropriate plus or minus sign.
• Consequently the fraction of SSy explained by the
Consequently, • If two measures have a linear relationship, it is possible
model is: to describe how strong the relationship is by means of a
statistic called a correlation coefficient r.
• The symbol for the correlation coefficient is r.
• The symbol for the corresponding population
parameter is ρ (the Greek letter "rho").
where r2 is called the coefficient of determination.
53 54
9
10. 3/22/2010
CORRELATION COEFFICIENT (r) CORRELATION COEFFICIENT (r)
• Pearson product-moment correlation • The basic formulas for the correlation
coefficient are
55 56
PEARSONS PRODUCT-MOMENT CORRELATION
CORRELATION COEFFICIENT (r) The Coefficient of Determination r2
• Given a set of data. (Example 11.10, pg 435 ) • The coefficient of determination finds numerous
applications in regression and multiple regression problems.
Find r x y x y
0.414 29186 0.548 67095 • Since SSregression is bounded by 0≤ SSregression ≤SSy there are
0.383
0.399
29266
26215
0.581
0.557
85156
69571
corresponding bounds on the coefficient of determination
0.402 30162 0.55 84160 given by 0 ≤ r2 ≤ 1.1
0.442 38867 0.531 73466
0.422 37831 0.55 78610 • When r2 = 0 the regression model has little value because
0.466
0.5
44576
46097
0.556
0.523
67657
74017
very little of the variation in y is attributable to its
0.514 59698 0.602 87291 dependence on r. When r2 = 1 the regression model almost
0.53 67705 0.569 86836
0.569 66088 0.544 82540 completely explains all of the variation in the response, that
0.558
0.577
78486
89869
0.557
0.53
81699
82096
is, r almost perfectly predicts y.
0.572
0.548
77369
67095
0.547
0.585
75657
80490
• We're usually hoping for r2 = l, but this rarely happens.
57 58
Confidence Interval for the Confidence Interval for the
Coefficient of Determination r2 Correlation Coefficient (r)
• When the distribution of the regression model residuals
• The coefficient of determination r2 is a statistic that is normal with constant variance, the distribution of r is
represents the proportion of the total variation in the complicated, but the distribution of:
values of the variable Y that can be accounted for or
explained by a linear relationship with the random
l i d b li l i hi i h h d is appro imatel normal with mean:
approximately ith
variable X.
and standard deviation:
• A different data set of (x, y) values will give a
different value of r2. The quantity that such r2 values
• The transformation of r into Z is called Fisher's Z
estimate is the true population coefficient of transformation.
determination p2, which is a parameter.
59 60
10
11. 3/22/2010
Confidence Interval for the LINEAR REGRESSION
Correlation Coefficient (r) WITH MINITAB
• This information can be used to construct a • MINITAB provides two basic functions for
confidence interval for the unknown parameter performing linear regression
µz from the statistic r and the sample size n. 1. Stat Regression> Fitted Line Plot menu is
The confidence interval is: the best place to start to evaluate the q
p quality
y
of the fitted function.
Includes a scatter plot of the (x, y,) data with the
superimposed fitted line, a full ANOVA table
and an abbreviated table of regression
coefficients.
61 62
LINEAR REGRESSION LINEAR REGRESSION
WITH MINITAB WITH MINITAB
Stat>Regression> Regression menu
2. Stat>Regression> Regression menu
The first part is a table of the regression coefficients and the
corresponding standard deviations, t values, and p values. The
second part is the ANOVA table, which summarizes the statistics
required to determine the regression coefficients and the summary
statistics like r, r2, radj. and sε.
t ti ti lik d
There is a p-value reported for the slope of the regression line in
the table of regression coefficients and another p value reported in
the ANOVA table for the ANOVA F test. These two p values are
numerically identical and not just by coincidence. There is a
special relationship that exists between the t and F distributions
when the F distribution has one numerator degree of freedom.
63 64
POLYNOMIAL MODELS
• The general form of a polynomial model is:
ŷ = a + b1 x + b2x2 + …+bpxp
POLYNOMIAL MODELS
where the polynomial is said to be of order p. The
p y p
regression coefficients a, b1, . . . ,bp are determined using
ŷ = a + b1 x + b2x2 + …+bpxp the same algorithm that was used for the simple linear
model; the error sum of squares is simultaneously
minimized with respect to the regression coefficients. The
family of equations that must be solved to determine the
regression coefficients is nightmarish, but most of the good
statistical software packages have this capability.
65 66
11
12. 3/22/2010
POLYNOMIAL MODELS POLYNOMIAL MODELS
• Although high-order polynomial models can fit • Because of their complexity, it's important to
the (x, y) data very well, they should be of the summarize the performance of polynomial
lowest order possible that accurately represents
the relationship between y and x. There are no models using r2adjusted instead of r2. In some
clear guidelines on what order might be
l id li h t d i ht b cases when there are relatively few error
necessary, but watch the significance (that is, the degrees of freedom after fitting a large
p values) of the various regression coefficients to polynomial model, the r2 value could be
confirm that all of the terms are contributing to misleadingly large whereas r2adjusted will be
the model. Polynomial models must also be
much lower but more representative of the true
hierarchical, that is, a model of order p must
contain all possible lower-order terms. performance of the model.
67 68
POLYNOMIAL MODELS POLYNOMIAL MODELS
• Fit the following data with an appropriate • Solution:
model and use scatter plots and residuals scatter plots and
diagnostic plots to check for lack of fit. residuals diagnostic plots
69 70
POLYNOMIAL MODELS POLYNOMIAL MODELS
• Solution: SCATTER PLOT • Solution: SCATTER PLOT
71 72
12
14. 3/22/2010
POLYNOMIAL MODELS
• Solution: Quadratic Model
Stat > Regression > Fitted Line Plot (x,y) – Quadratic
Multiple Regression
79 80
Multiple Regression Multiple Regression
• When a response has n quantitative predictors • This equation has the same basic structure as
such as y (x1 x2, .. . , xn), the model for y must the polynomial model and, in fact, the two
be created by multiple regression. In multiple models are fitted and analyzed in much the
regression each predictive term in the model same way. Where the work-sheet to fit the
polynomial model requires n columns, one for
has its own regression coefficient. The each power of x, the worksheet to fit the
simplest multiple regression model contains a multiple regression model requires n columns
linear term for each predictor: to account for each of the n predictors. The
same regression methods are used to analyze
both problems.
81 82
Multiple Regression Multiple Regression
• Frequently, the simple linear model in does not • PROBLEM • Selling Price Table
(in thousands of dollars)
fit the data and a more complex model is A real‐estate executive
would like to be able to
required. The terms that must be added to the predict the cost of a house
model to achieve a good fit might involve in a housing development
in a housing development
interactions, quadratic terms, or terms of even on the basis of the number
of bedrooms and bath‐
higher order. Such models have the basic form: rooms in the house.
83 84
14
15. 3/22/2010
Multiple Regression Multiple Regression
• The following first-order model is assumed to connect the • MINITAB SOLUTION
selling price of the home with the number of bedrooms and the • Stat > Regression > Regression.
number of baths. The dependent variable is represented by y
and the independent variables are x1,the number of bedrooms,
and x2, the number of baths.
85 86
Multiple Regression Multiple Regression
• MINITAB Output
– Stat > Regression > Regression.
87 88
Multiple Regression Multiple Regression
• Problem
Blood pressure
The following table contains data from a blood pressure study. The
study on fifty
data were collected on a group of middle aged men. Systolic is the
systolic blood pressure, Age is the age of the individual, Weight is middle‐aged men.
the weight in pounds, Parents indicates whether the individual's
parents had high blood pressure: 0 means neither parent has high
blood pressure, 1 means one parent has high blood pressure, and 2
means both mother and father have high blood pressure, Med is the
number of hours per month that the individual meditates, and TypeA
is a measure of the degree to which the individual exhibits type A
personality behavior, as determined from a form that the person fills
out. Systolic is the dependent variable and the other five variables
are the independent variables
89 90
15
16. 3/22/2010
Multiple Regression Multiple Regression
• MINITAB SOLUTION : Stat > Regression > Regression
• Model
Y = systolic,
xl = age,
x2 = weight,
x3 = parents,
x4 = med, and
x5 = Type A
91 92
Multiple Regression
Multiple Regression
Checking the Overall Utitity of a Model
• MINITAB SOLUTION : Stat > Regression > Regression
• Purpose: Check whether the model is useful and to
control your α value
Rather than conduct a large group of t-tests on the betas
and increase the probability of making a type 1 error
make one test and know that α= 0.05. The F-test is such a
test. It is contained in the analysis of variance associated
with the analysis. The F-test tests the following
The five hypothesis tests suggest Weight and Type A should be kept and
hypothesis associated with the blood pressure model
the other three variables thrown out.
93 94
Multiple Regression
• MINITAB SOLUTION : Stat > Regression > Regression
Interpretation
As is seen F= 18.50 with a
p‐value of 0.000 and the null
END
hypothesis should be rejected;
the conclusion is that at least
one βi ≠ 0. This F‐test says that
the model is useful in predicting
systolic blood pressure.
95 96
16