SlideShare a Scribd company logo
1 of 245
©The McGraw-Hill Companies, Inc. 2008
McGraw-Hill/Irwin
One Sample Tests of Hypothesis
Chapter 10
2
GOALS
l Define a hypothesis and hypothesis testing.
l Describe the five-step hypothesis-testing procedure.
l Distinguish between a one-tailed and a two-tailed
test of hypothesis.
l Conduct a test of hypothesis about a population
mean.
l Conduct a test of hypothesis about a population
proportion.
l Define Type I and Type II errors.
l Compute the probability of a Type II error.
3
What is a Hypothesis?
A Hypothesis is a statement about the
value of a population parameter
developed for the purpose of testing.
Examples of hypotheses made about a
population parameter are:
– The mean monthly income for systems analysts is
$3,625.
– Twenty percent of all customers at Bovine’s Chop
House return for another meal within a month.
4
What is Hypothesis Testing?
Hypothesis testing is a procedure, based
on sample evidence and probability
theory, used to determine whether the
hypothesis is a reasonable statement
and should not be rejected, or is
unreasonable and should be rejected.
5
Hypothesis Testing Steps
6
Important Things to Remember about H0 and H1
l H0: null hypothesis and H1: alternate hypothesis
l H0 and H1 are mutually exclusive and collectively exhaustive
l H0 is always presumed to be true
l H1 has the burden of proof
l A random sample (n) is used to “reject H0”
l If we conclude 'do not reject H0', this does not necessarily mean
that the null hypothesis is true, it only suggests that there is not
sufficient evidence to reject H0; rejecting the null hypothesis
then, suggests that the alternative hypothesis may be true.
l Equality is always part of H0 (e.g. “=” , “≥” , “≤”).
l “≠” “<” and “>” always part of H1
7
How to Set Up a Claim as Hypothesis
l In actual practice, the status quo is set up as H0
l If the claim is “boastful” the claim is set up as H1
(we apply the Missouri rule – “show me”).
Remember, H1 has the burden of proof
l In problem solving, look for key words and
convert them into symbols. Some key words
include: “improved, better than, as effective as,
different from, has changed, etc.”
8
Left-tail or Right-tail Test?
Keywords
Inequality
Symbol
Part of:
Larger (or more) than > H1
Smaller (or less) < H1
No more than £ H0
At least ≥ H0
Has increased > H1
Is there difference? ≠ H1
Has not changed = H0
Has “improved”, “is better
than”. “is more effective”
See right H1
• The direction of the test involving
claims that use the words “has
improved”, “is better than”, and the like
will depend upon the variable being
measured.
• For instance, if the variable involves
time for a certain medication to take
effect, the words “better” “improve” or
more effective” are translated as “<”
(less than, i.e. faster relief).
• On the other hand, if the variable
refers to a test score, then the words
“better” “improve” or more effective”
are translated as “>” (greater than, i.e.
higher test scores)
9
10
Parts of a Distribution in Hypothesis Testing
11
One-tail vs. Two-tail Test
12
Hypothesis Setups for Testing a Mean (m)
13
Hypothesis Setups for Testing a
Proportion (p)
14
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Jamestown Steel Company
manufactures and assembles desks
and other office equipment at
several plants in western New York
State. The weekly production of the
Model A325 desk at the Fredonia
Plant follows the normal probability
distribution with a mean of 200 and
a standard deviation of 16.
Recently, because of market
expansion, new production
methods have been introduced and
new employees hired. The vice
president of manufacturing would
like to investigate whether there has
been a change in the weekly
production of the Model A325 desk.
15
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Step 1: State the null hypothesis and the alternate
hypothesis.
H0: m = 200
H1: m ≠ 200
(note: keyword in the problem “has changed”)
Step 2: Select the level of significance.
α = 0.01 as stated in the problem
Step 3: Select the test statistic.
Use Z-distribution since σ is known
16
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Step 4: Formulate the decision rule.
Reject H0 if |Z| > Za/2
58
.
2
not
is
55
.
1
50
/
16
200
5
.
203
/
2
/
01
.
2
/
2
/
>
>
-
>
-
>
Z
Z
n
X
Z
Z
a
a
s
m
Step 5: Make a decision and interpret the result.
Because 1.55 does not fall in the rejection region, H0 is not
rejected. We conclude that the population mean is not different from
200. So we would report to the vice president of manufacturing that the
sample evidence does not show that the production rate at the Fredonia
Plant has changed from 200 per week.
17
Suppose in the previous problem the vice
president wants to know whether there has
been an increase in the number of units
assembled. To put it another way, can we
conclude, because of the improved
production methods, that the mean number
of desks assembled in the last 50 weeks was
more than 200?
Recall: σ=16, n=200, α=.01
Testing for a Population Mean with a Known
Population Standard Deviation- Another Example
18
Testing for a Population Mean with a Known
Population Standard Deviation- Example
Step 1: State the null hypothesis and the alternate
hypothesis.
H0: m ≤ 200
H1: m > 200
(note: keyword in the problem “an increase”)
Step 2: Select the level of significance.
α = 0.01 as stated in the problem
Step 3: Select the test statistic.
Use Z-distribution since σ is known
19
Testing for a Population Mean with a Known
Population Standard Deviation- Example
Step 4: Formulate the decision rule.
Reject H0 if Z > Za
Step 5: Make a decision and interpret the result.
Because 1.55 does not fall in the rejection region, H0 is not rejected.
We conclude that the average number of desks assembled in the last
50 weeks is not more than 200
20
Type of Errors in Hypothesis Testing
l Type I Error -
– Defined as the probability of rejecting the null
hypothesis when it is actually true.
– This is denoted by the Greek letter “a”
– Also known as the significance level of a test
l Type II Error:
– Defined as the probability of “accepting” the null
hypothesis when it is actually false.
– This is denoted by the Greek letter “β”
21
p-Value in Hypothesis Testing
l p-VALUE is the probability of observing a sample
value as extreme as, or more extreme than, the
value observed, given that the null hypothesis is
true.
l In testing a hypothesis, we can also compare the p-
value to with the significance level (a).
l If the p-value < significance level, H0 is rejected, else
H0 is not rejected.
22
p-Value in Hypothesis Testing - Example
Recall the last problem where the
hypothesis and decision rules
were set up as:
H0: m ≤ 200
H1: m > 200
Reject H0 if Z > Za
where Z = 1.55 and Za =2.33
Reject H0 if p-value < a
0.0606 is not < 0.01
Conclude: Fail to reject H0
23
What does it mean when p-value < a?
(a) .10, we have some evidence that H0 is not true.
(b) .05, we have strong evidence that H0 is not true.
(c) .01, we have very strong evidence that H0 is not true.
(d) .001, we have extremely strong evidence that H0 is not
true.
24
Testing for the Population Mean: Population
Standard Deviation Unknown
l When the population standard deviation (σ) is
unknown, the sample standard deviation (s) is used in
its place
l The t-distribution is used as test statistic, which is
computed using the formula:
25
Testing for the Population Mean: Population
Standard Deviation Unknown - Example
The McFarland Insurance Company Claims Department reports the mean
cost to process a claim is $60. An industry comparison showed this
amount to be larger than most other insurance companies, so the
company instituted cost-cutting measures. To evaluate the effect of the
cost-cutting measures, the Supervisor of the Claims Department
selected a random sample of 26 claims processed last month. The
sample information is reported below.
At the .01 significance level is it reasonable a claim is now less than $60?
26
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Step 1: State the null hypothesis and the alternate
hypothesis.
H0: m ≥ $60
H1: m < $60
(note: keyword in the problem “now less than”)
Step 2: Select the level of significance.
α = 0.01 as stated in the problem
Step 3: Select the test statistic.
Use t-distribution since σ is unknown
27
t-Distribution Table (portion)
28
Testing for the Population Mean: Population
Standard Deviation Unknown – Minitab Solution
29
Testing for a Population Mean with a
Known Population Standard Deviation- Example
Step 5: Make a decision and interpret the result.
Because -1.818 does not fall in the rejection region, H0 is not rejected at the
.01 significance level. We have not demonstrated that the cost-cutting
measures reduced the mean cost per claim to less than $60. The difference
of $3.58 ($56.42 - $60) between the sample mean and the population mean
could be due to sampling error.
Step 4: Formulate the decision rule.
Reject H0 if t < -ta,n-1
30
The current rate for producing 5 amp fuses at Neary
Electric Co. is 250 per hour. A new machine has
been purchased and installed that, according to the
supplier, will increase the production rate. A sample
of 10 randomly selected hours from last month
revealed the mean hourly production on the new
machine was 256 units, with a sample standard
deviation of 6 per hour.
At the .05 significance level can Neary conclude that
the new machine is faster?
Testing for a Population Mean with an Unknown
Population Standard Deviation- Example
31
Testing for a Population Mean with a
Known Population Standard Deviation- Example continued
Step 1: State the null and the alternate hypothesis.
H0: µ ≤ 250; H1: µ > 250
Step 2: Select the level of significance.
It is .05.
Step 3: Find a test statistic. Use the t distribution
because the population standard deviation is not
known and the sample size is less than 30.
32
Testing for a Population Mean with a
Known Population Standard Deviation- Example continued
Step 4: State the decision rule.
There are 10 – 1 = 9 degrees of freedom. The null
hypothesis is rejected if t > 1.833.
Step 5: Make a decision and interpret the results.
The null hypothesis is rejected. The mean number produced is
more than 250 per hour.
162
.
3
10
6
250
256
=
-
=
-
=
n
s
X
t
m
33
Tests Concerning Proportion
l A Proportion is the fraction or percentage that indicates the part of
the population or sample having a particular trait of interest.
l The sample proportion is denoted by p and is found by x/n
l The test statistic is computed as follows:
34
Assumptions in Testing a Population Proportion
using the z-Distribution
l A random sample is chosen from the population.
l It is assumed that the binomial assumptions discussed in
Chapter 6 are met:
(1) the sample data collected are the result of counts;
(2) the outcome of an experiment is classified into one of two
mutually exclusive categories—a “success” or a “failure”;
(3) the probability of a success is the same for each trial; and
(4) the trials are independent
l The test we will conduct shortly is appropriate when both np
and n(1- p ) are at least 5.
l When the above conditions are met, the normal distribution can
be used as an approximation to the binomial distribution
35
Test Statistic for Testing a Single
Population Proportion
n
p
z
)
1
( p
p
p
-
-
=
Sample proportion
Hypothesized
population proportion
Sample size
36
Test Statistic for Testing a Single
Population Proportion - Example
Suppose prior elections in a certain state indicated
it is necessary for a candidate for governor to
receive at least 80 percent of the vote in the
northern section of the state to be elected. The
incumbent governor is interested in assessing
his chances of returning to office and plans to
conduct a survey of 2,000 registered voters in
the northern section of the state. Using the
hypothesis-testing procedure, assess the
governor’s chances of reelection.
37
Test Statistic for Testing a Single
Population Proportion - Example
Step 1: State the null hypothesis and the alternate
hypothesis.
H0: p ≥ .80
H1: p < .80
(note: keyword in the problem “at least”)
Step 2: Select the level of significance.
α = 0.01 as stated in the problem
Step 3: Select the test statistic.
Use Z-distribution since the assumptions are met
and np and n(1-p) ≥ 5
38
Testing for a Population Proportion - Example
Step 5: Make a decision and interpret the result.
The computed value of z (2.80) is in the rejection region, so the null hypothesis is rejected
at the .05 level. The difference of 2.5 percentage points between the sample percent (77.5
percent) and the hypothesized population percent (80) is statistically significant. The
evidence at this point does not support the claim that the incumbent governor will return to
the governor’s mansion for another four years.
Step 4: Formulate the decision rule.
Reject H0 if Z <-Za
39
Type II Error
l Recall Type I Error, the level of significance,
denoted by the Greek letter “a”, is defined as
the probability of rejecting the null hypothesis
when it is actually true.
l Type II Error, denoted by the Greek letter “β”,is
defined as the probability of “accepting” the null
hypothesis when it is actually false.
40
Type II Error - Example
A manufacturer purchases steel bars to make cotter
pins. Past experience indicates that the mean tensile
strength of all incoming shipments is 10,000 psi and
that the standard deviation, σ, is 400 psi. In order to
make a decision about incoming shipments of steel
bars, the manufacturer set up this rule for the quality-
control inspector to follow: “Take a sample of 100
steel bars. At the .05 significance level if the sample
mean strength falls between 9,922 psi and 10,078
psi, accept the lot. Otherwise the lot is to be
rejected.”
41
Type I and Type II Errors Illustrated
42
Type II Error Computed
43
Type II Errors For Varying Mean Levels
44
End of Chapter 10
©The McGraw-Hill Companies, Inc. 2008
McGraw-Hill/Irwin
Two-sample Tests of Hypothesis
Chapter 11
2
GOALS
l Conduct a test of a hypothesis about the difference
between two independent population means.
l Conduct a test of a hypothesis about the difference
between two population proportions.
l Conduct a test of a hypothesis about the mean
difference between paired or dependent
observations.
l Understand the difference between dependent and
independent samples.
3
Comparing two populations – Some
Examples
l Is there a difference in the mean value of residential real
estate sold by male agents and female agents in south
Florida?
l Is there a difference in the mean number of defects
produced on the day and the afternoon shifts at Kimble
Products?
l Is there a difference in the mean number of days absent
between young workers (under 21 years of age) and older
workers (more than 60 years of age) in the fast-food
industry?
l Is there is a difference in the proportion of Ohio State
University graduates and University of Cincinnati graduates
who pass the state Certified Public Accountant Examination
on their first attempt?
l Is there an increase in the production rate if music is piped
into the production area?
4
Comparing Two Population Means
l No assumptions about the shape of the populations are
required.
l The samples are from independent populations.
l The formula for computing the value of z is:
2
2
2
1
2
1
2
1
2
1 known
are
and
if
or
30
sizes
sample
if
Use
n
n
X
X
z
s
s
s
s
+
-
=
>
2
2
2
1
2
1
2
1
2
1 unknown
are
and
if
and
30
sizes
sample
if
Use
n
s
n
s
X
X
z
+
-
=
>
s
s
5
EXAMPLE 1
The U-Scan facility was recently installed at the Byrne
Road Food-Town location. The store manager would
like to know if the mean checkout time using the
standard checkout method is longer than using the U-
Scan. She gathered the following sample information.
The time is measured from when the customer enters
the line until their bags are in the cart. Hence the time
includes both waiting in line and checking out.
6
EXAMPLE 1 continued
Step 1: State the null and alternate hypotheses.
H0: µS ≤ µU
H1: µS > µU
Step 2: State the level of significance.
The .01 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
Because both samples are more than 30, we can use z-distribution
as the test statistic.
7
Example 1 continued
Step 4: State the decision rule.
Reject H0 if Z > Za
Z > 2.33
8
Example 1 continued
Step 5: Compute the value of z and make a decision
13
.
3
064
.
0
2
.
0
100
30
.
0
50
40
.
0
3
.
5
5
.
5
2
2
2
2
=
=
+
-
=
+
-
=
u
u
s
s
u
s
n
n
X
X
z
s
s
The computed value of 3.13 is larger than the
critical value of 2.33. Our decision is to reject the
null hypothesis. The difference of .20 minutes
between the mean checkout time using the
standard method is too large to have occurred by
chance. We conclude the U-Scan method is
faster.
9
Two-Sample Tests about Proportions
Here are several examples.
l The vice president of human resources wishes to know whether
there is a difference in the proportion of hourly employees who
miss more than 5 days of work per year at the Atlanta and the
Houston plants.
l General Motors is considering a new design for the Pontiac
Grand Am. The design is shown to a group of potential buyers
under 30 years of age and another group over 60 years of age.
Pontiac wishes to know whether there is a difference in the
proportion of the two groups who like the new design.
l A consultant to the airline industry is investigating the fear of
flying among adults. Specifically, the company wishes to know
whether there is a difference in the proportion of men versus
women who are fearful of flying.
10
Two Sample Tests of Proportions
l We investigate whether two samples came from
populations with an equal proportion of successes.
l The two samples are pooled using the following
formula.
11
Two Sample Tests of Proportions
continued
The value of the test statistic is computed from the following
formula.
12
Manelli Perfume Company recently developed a new fragrance that
it plans to market under the name Heavenly. A number of market
studies indicate that Heavenly has very good market potential. The
Sales Department at Manelli is particularly interested in whether
there is a difference in the proportions of younger and older women
who would purchase Heavenly if it were marketed. There are two
independent populations, a population consisting
of the younger women and a population consisting of the older
women. Each sampled woman will be asked to smell Heavenly and
indicate whether she likes the fragrance well enough to purchase a
bottle.
Two Sample Tests of Proportions -
Example
13
Step 1: State the null and alternate hypotheses.
H0: p1 = p 2
H1: p 1 ≠ p 2
Step 2: State the level of significance.
The .05 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
We will use the z-distribution
Two Sample Tests of Proportions -
Example
14
Step 4: State the decision rule.
Reject H0 if Z > Za/2 or Z < - Za/2
Z > 1.96 or Z < -1.96
Two Sample Tests of Proportions -
Example
15
Step 5: Compute the value of z and make a decision
The computed value of 2.21 is in the area of rejection. Therefore, the null hypothesis is
rejected at the .05 significance level. To put it another way, we reject the null hypothesis
that the proportion of young women who would purchase Heavenly is equal to the
proportion of older women who would purchase Heavenly.
Two Sample Tests of Proportions -
Example
16
Two Sample Tests of Proportions –
Example (Minitab Solution)
17
Comparing Population Means with Unknown
Population Standard Deviations (the Pooled t-test)
The t distribution is used as the test statistic if one
or more of the samples have less than 30
observations. The required assumptions are:
1. Both populations must follow the normal
distribution.
2. The populations must have equal standard
deviations.
3. The samples are from independent populations.
18
Small sample test of means continued
Finding the value of the test
statistic requires two
steps.
1. Pool the sample standard
deviations.
2. Use the pooled standard
deviation in the formula.
2
)
1
(
)
1
(
2
1
2
2
2
2
1
1
2
-
+
-
+
-
=
n
n
s
n
s
n
sp
÷
÷
ø
ö
ç
ç
è
æ
+
-
=
2
1
2
2
1
1
1
n
n
s
X
X
t
p
19
Owens Lawn Care, Inc., manufactures and assembles
lawnmowers that are shipped to dealers throughout the
United States and Canada. Two different procedures
have been proposed for mounting the engine on the
frame of the lawnmower. The question is: Is there a
difference in the mean time to mount the engines on the
frames of the lawnmowers? The first procedure was
developed by longtime Owens employee Herb Welles
(designated as procedure 1), and the other procedure
was developed by Owens Vice President of Engineering
William Atkins (designated as procedure 2). To evaluate
the two methods, it was decided to conduct a time and
motion study.
A sample of five employees was timed using the Welles
method and six using the Atkins method. The results, in
minutes, are shown on the right.
Is there a difference in the mean mounting times? Use
the .10 significance level.
Comparing Population Means with Unknown
Population Standard Deviations (the Pooled t-test)
20
Step 1: State the null and alternate hypotheses.
H0: µ1 = µ2
H1: µ1 ≠ µ2
Step 2: State the level of significance. The .10 significance level is
stated in the problem.
Step 3: Find the appropriate test statistic.
Because the population standard deviations are not known but are
assumed to be equal, we use the pooled t-test.
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
21
Step 4: State the decision rule.
Reject H0 if t > ta/2,n1+n2-2 or t < - ta/2,n1+n2-2
t > t.05,9 or t < - t.05,9
t > 1.833 or t < - 1.833
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
22
Step 5: Compute the value of t and make a decision
(a) Calculate the sample standard deviations
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
23
Step 5: Compute the value of t and make a decision
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
-0.662
The decision is not to reject
the null hypothesis, because
0.662 falls in the region
between -1.833 and 1.833.
We conclude that there is no
difference in the mean times
to mount the engine on the
frame using the two methods.
24
Comparing Population Means with Unknown Population
Standard Deviations (the Pooled t-test) - Example
25
Comparing Population Means with Unequal
Population Standard Deviations
If it is not reasonable to assume the
population standard deviations are
equal, then we compute the t-
statistic shown on the right.
The sample standard deviations s1 and
s2 are used in place of the
respective population standard
deviations.
In addition, the degrees of freedom are
adjusted downward by a rather
complex approximation formula.
The effect is to reduce the number
of degrees of freedom in the test,
which will require a larger value of
the test statistic to reject the null
hypothesis.
26
Comparing Population Means with Unequal
Population Standard Deviations - Example
Personnel in a consumer testing laboratory are evaluating the absorbency of
paper towels. They wish to compare a set of store brand towels to a similar
group of name brand ones. For each brand they dip a ply of the paper into a
tub of fluid, allow the paper to drain back into the vat for two minutes, and
then evaluate the amount of liquid the paper has taken up from the vat. A
random sample of 9 store brand paper towels absorbed the following
amounts of liquid in milliliters.
8 8 3 1 9 7 5 5 12
An independent random sample of 12 name brand towels absorbed the
following amounts of liquid in milliliters:
12 11 10 6 8 9 9 10 11 9 8 10
Use the .10 significance level and test if there is a difference in the mean
amount of liquid absorbed by the two types of paper towels.
27
The following dot plot provided by MINITAB shows the
variances to be unequal.
Comparing Population Means with Unequal
Population Standard Deviations - Example
28
Step 1: State the null and alternate hypotheses.
H0: m1 = m2
H1: m1 ≠ m2
Step 2: State the level of significance.
The .10 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
We will use unequal variances t-test
Comparing Population Means with Unequal
Population Standard Deviations - Example
29
Step 4: State the decision rule.
Reject H0 if
t > ta/2d.f. or t < - ta/2,d.f.
t > t.05,10 or t < - t.05, 10
t > 1.812 or t < -1.812
Step 5: Compute the value of t
and make a decision
The computed value of t is less than the lower critical value, so our
decision is to reject the null hypothesis. We conclude that the
mean absorption rate for the two towels is not the same.
Comparing Population Means with Unequal
Population Standard Deviations - Example
30
Minitab
31
Two-Sample Tests of Hypothesis:
Dependent Samples
Dependent samples are samples that are paired or
related in some fashion.
For example:
– If you wished to buy a car you would look at the
same car at two (or more) different dealerships
and compare the prices.
– If you wished to measure the effectiveness of a
new diet you would weigh the dieters at the start
and at the finish of the program.
32
Hypothesis Testing Involving
Paired Observations
Use the following test when the samples are
dependent:
t
d
s n
d
=
/
d
Where
is the mean of the differences
sd is the standard deviation of the differences
n is the number of pairs (differences)
33
Nickel Savings and Loan wishes to
compare the two companies it
uses to appraise the value of
residential homes. Nickel
Savings selected a sample of
10 residential properties and
scheduled both firms for an
appraisal. The results, reported
in $000, are shown on the table
(right).
At the .05 significance level, can
we conclude there is a
difference in the mean
appraised values of the homes?
Hypothesis Testing Involving
Paired Observations - Example
34
Step 1: State the null and alternate hypotheses.
H0: md = 0
H1: md ≠ 0
Step 2: State the level of significance.
The .05 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
We will use the t-test
Hypothesis Testing Involving
Paired Observations - Example
35
Step 4: State the decision rule.
Reject H0 if
t > ta/2, n-1 or t < - ta/2,n-1
t > t.025,9 or t < - t.025, 9
t > 2.262 or t < -2.262
Hypothesis Testing Involving
Paired Observations - Example
36
Step 5: Compute the value of t and make a decision
The computed value of t
is greater than the
higher critical value, so
our decision is to reject
the null hypothesis. We
conclude that there is a
difference in the mean
appraised values of the
homes.
Hypothesis Testing Involving
Paired Observations - Example
37
Hypothesis Testing Involving Paired Observations –
Excel Example
38
End of Chapter 11
©The McGraw-Hill Companies, Inc. 2008
McGraw-Hill/Irwin
Analysis of Variance
Chapter 12
2
GOALS
l List the characteristics of the F distribution.
l Conduct a test of hypothesis to determine whether the
variances of two populations are equal.
l Discuss the general idea of analysis of variance.
l Organize data into a one-way and a two-way ANOVA table.
l Conduct a test of hypothesis among three or more treatment
means.
l Develop confidence intervals for the difference in treatment
means.
l Conduct a test of hypothesis among treatment means using a
blocking variable.
l Conduct a two-way ANOVA with interaction.
3
Characteristics of F-Distribution
l There is a “family” of F
Distributions.
l Each member of the family is
determined by two parameters:
the numerator degrees of
freedom and the denominator
degrees of freedom.
l F cannot be negative, and it is
a continuous distribution.
l The F distribution is positively
skewed.
l Its values range from 0 to ¥
l As F ® ¥ the curve
approaches the X-axis.
4
Comparing Two Population Variances
The F distribution is used to test the hypothesis that the variance of one
normal population equals the variance of another normal population.
The following examples will show the use of the test:
l Two Barth shearing machines are set to produce steel bars of the
same length. The bars, therefore, should have the same mean length.
We want to ensure that in addition to having the same mean length
they also have similar variation.
l The mean rate of return on two types of common stock may be the
same, but there may be more variation in the rate of return in one than
the other. A sample of 10 technology and 10 utility stocks shows the
same mean rate of return, but there is likely more variation in the
Internet stocks.
l A study by the marketing department for a large newspaper found that
men and women spent about the same amount of time per day
reading the paper. However, the same report indicated there was
nearly twice as much variation in time spent per day among the men
than the women.
5
Test for Equal Variances
6
Test for Equal Variances - Example
Lammers Limos offers limousine service from the city hall in Toledo,
Ohio, to Metro Airport in Detroit. Sean Lammers, president of the
company, is considering two routes. One is via U.S. 25 and the
other via I-75. He wants to study the time it takes to drive to the
airport using each route and then compare the results. He collected
the following sample data, which is reported in minutes.
Using the .10 significance level, is there a difference in the variation
in the driving times for the two routes?
7
Step 1: The hypotheses are:
H0: σ1
2 = σ1
2
H1: σ1
2 ≠ σ1
2
Step 2: The significance level is .05.
Step 3: The test statistic is the F distribution.
Test for Equal Variances - Example
8
Step 4: State the decision rule.
Reject H0 if F > Fa/2,v1,v2
F > F.05/2,7-1,8-1
F > F.025,6,7
Test for Equal Variances - Example
9
The decision is to reject the null hypothesis, because the computed F
value (4.23) is larger than the critical value (3.87).
We conclude that there is a difference in the variation of the travel times along
the two routes.
Step 5: Compute the value of F and make a decision
Test for Equal Variances - Example
10
Test for Equal Variances – Excel
Example
11
Comparing Means of Two or More
Populations
l The F distribution is also used for testing whether
two or more sample means came from the same
or equal populations.
l Assumptions:
– The sampled populations follow the normal
distribution.
– The populations have equal standard
deviations.
– The samples are randomly selected and are
independent.
12
l The Null Hypothesis is that the population
means are the same. The Alternative Hypothesis
is that at least one of the means is different.
l The Test Statistic is the F distribution.
l The Decision rule is to reject the null hypothesis
if F (computed) is greater than F (table) with
numerator and denominator degrees of freedom.
l Hypothesis Setup and Decision Rule:
H0: µ1 = µ2 =…= µk
H1: The means are not all equal
Reject H0 if F > Fa,k-1,n-k
Comparing Means of Two or More
Populations
13
Analysis of Variance – F statistic
l If there are k populations being sampled, the numerator degrees
of freedom is k – 1.
l If there are a total of n observations the denominator degrees of
freedom is n – k.
l The test statistic is computed by:
( )
( )
k
n
SSE
k
SST
F
-
-
=
1
14
Joyce Kuhlman manages a regional financial center. She wishes to
compare the productivity, as measured by the number of customers
served, among three employees. Four days are randomly selected
and the number of customers served by each employee is recorded.
The results are:
Comparing Means of Two or More
Populations – Illustrative Example
15
Comparing Means of Two or More
Populations – Illustrative Example
16
Recently a group of four major carriers
joined in hiring Brunner Marketing
Research, Inc., to survey recent
passengers regarding their level of
satisfaction with a recent flight.
The survey included questions on
ticketing, boarding, in-flight
service, baggage handling, pilot
communication, and so forth.
Twenty-five questions offered a
range of possible answers:
excellent, good, fair, or poor. A
response of excellent was given a
score of 4, good a 3, fair a 2, and
poor a 1. These responses were
then totaled, so the total score
was an indication of the
satisfaction with the flight. Brunner
Marketing Research, Inc.,
randomly selected and surveyed
passengers from the four airlines.
Comparing Means of Two or More
Populations – Example
Is there a difference in the mean
satisfaction level among the four
airlines?
Use the .01 significance level.
17
Step 1: State the null and alternate hypotheses.
H0: µE = µA = µT = µO
H1: The means are not all equal
Reject H0 if F > Fa,k-1,n-k
Step 2: State the level of significance.
The .01 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
Because we are comparing means of more than
two groups, use the F statistic
Comparing Means of Two or More
Populations – Example
18
Step 4: State the decision rule.
Reject H0 if F > Fa,k-1,n-k
F > F01,4-1,22-4
F > F01,3,18
F > 5.801
Comparing Means of Two or More
Populations – Example
19
Step 5: Compute the value of F and make a decision
Comparing Means of Two or More
Populations – Example
20
Comparing Means of Two or More
Populations – Example
21
Computing SS Total and SSE
22
Computing SST
The computed value of F is 8.99, which is greater than the critical value of 5.09,
so the null hypothesis is rejected.
Conclusion: The population means are not all equal. The mean scores are not
the same for the four airlines; at this point we can only conclude there is a
difference in the treatment means. We cannot determine which treatment groups
differ or how many treatment groups differ.
23
Inferences About Treatment Means
l When we reject the null hypothesis
that the means are equal, we may
want to know which treatment means
differ.
l One of the simplest procedures is
through the use of confidence
intervals.
24
Confidence Interval for the
Difference Between Two Means
l where t is obtained from the t table with degrees of
freedom (n - k).
l MSE = [SSE/(n - k)]
( )
X X t MSE
n n
1 2
1 2
1 1
- ± +
æ
è
ç
ö
ø
÷
25
From the previous example, develop a 95% confidence interval
for the difference in the mean rating for Eastern and Ozark.
Can we conclude that there is a difference between the two
airlines’ ratings?
The 95 percent confidence interval ranges from 10.46 up to
26.04. Both endpoints are positive; hence, we can conclude
these treatment means differ significantly. That is, passengers
on Eastern rated service significantly different from those
on Ozark.
Confidence Interval for the
Difference Between Two Means - Example
26
Minitab
27
Excel
28
Two-Way Analysis of Variance
l For the two-factor ANOVA we test whether there is a
significant difference between the treatment effect
and whether there is a difference in the blocking
effect. Let Br be the block totals (r for rows)
l Let SSB represent the sum of squares for the blocks
where:
SSB
B
k
X
n
r
=
é
ë
ê
ù
û
ú -
S
S
2 2
( )
29
WARTA, the Warren Area Regional Transit Authority, is expanding bus
service from the suburb of Starbrick into the central business district of
Warren. There are four routes being considered from Starbrick to
downtown Warren: (1) via U.S. 6, (2) via the West End, (3) via the
Hickory Street Bridge, and (4) via Route 59.
WARTA conducted several tests to determine whether there was a difference
in the mean travel times along the four routes. Because there will be many
different drivers, the test was set up so each driver drove along each of the
four routes. Next slide shows the travel time, in minutes, for each driver-route
combination. At the .05 significance level, is there a difference in the mean
travel time along the four routes? If we remove the effect of the drivers, is
there a difference in the mean travel time?
Two-Way Analysis of Variance -
Example
30
Two-Way Analysis of Variance -
Example
31
Step 1: State the null and alternate hypotheses.
H0: µu = µw = µh = µr
H1: The means are not all equal
Reject H0 if F > Fa,k-1,n-k
Step 2: State the level of significance.
The .05 significance level is stated in the problem.
Step 3: Find the appropriate test statistic.
Because we are comparing means of more than
two groups, use the F statistic
Two-Way Analysis of Variance -
Example
32
Step 4: State the decision rule.
Reject H0 if F > Fa,v1,v2
F > F.05,k-1,n-k
F > F.05,4-1,20-4
F > F.05,3,16
F > 2.482
Two-Way Analysis of Variance -
Example
33
34
35
Using Excel to perform the
calculations. The
computed value of F is
2.482, so our decision is to
not reject the null
hypothesis. We conclude
there is no difference in
the mean travel time along
the four routes. There is
no reason to select one of
the routes as faster than
the other.
Two-Way Analysis of Variance – Excel
Example
36
Two-Way ANOVA with Interaction
Interaction occurs if the combination of two factors has some effect
on the variable under study, in addition to each factor alone. We refer
to the variable being studied as the response variable.
An everyday illustration of interaction is the effect of diet and exercise
on weight. It is generally agreed that a person’s weight (the response
variable) can be controlled with two factors, diet and exercise.
Research shows that weight is affected by diet alone and that weight
is affected by exercise alone. However, the general recommended
method to control weight is based on the combined or interaction
effect of diet and exercise.
37
Graphical Observation of Mean Times
Our graphical observations show us that
interaction effects are possible. The next
step is to conduct statistical tests of
hypothesis to further investigate the
possible interaction effects. In summary,
our study of travel times has several
questions:
l Is there really an interaction between
routes and drivers?
l Are the travel times for the drivers the
same?
l Are the travel times for the routes the
same?
Of the three questions, we are most
interested in the test for interactions. To
put it another way, does a particular
route/driver combination result in
significantly faster (or slower) driving
times? Also, the results of the hypothesis
test for interaction affect the way we
analyze the route and driver questions.
38
Interaction Effect
l We can investigate these questions statistically by extending
the two-way ANOVA procedure presented in the previous
section. We add another source of variation, namely, the
interaction.
l In order to estimate the “error” sum of squares, we need at
least two measurements for each driver/route combination.
l As example, suppose the experiment presented earlier is
repeated by measuring two more travel times for each driver
and route combination. That is, we replicate the experiment.
Now we have three new observations for each driver/route
combination.
l Using the mean of three travel times for each driver/route
combination we get a more reliable measure of the mean travel
time.
39
Example – ANOVA with Replication
40
Three Tests in ANOVA with Replication
The ANOVA now has three sets of hypotheses
to test:
1. H0: There is no interaction between drivers and routes.
H1: There is interaction between drivers and routes.
2. H0: The driver means are the same.
H1: The driver means are not the same.
3. H0: The route means are the same.
H1: The route means are not the same.
41
ANOVA Table
42
Excel Output
43
44
End of Chapter 12
©The McGraw-Hill Companies, Inc. 2008
McGraw-Hill/Irwin
Linear Regression and
Correlation
Chapter 13
2
GOALS
l Understand and interpret the terms dependent and
independent variable.
l Calculate and interpret the coefficient of correlation,
the coefficient of determination, and the standard
error of estimate.
l Conduct a test of hypothesis to determine whether
the coefficient of correlation in the population is zero.
l Calculate the least squares regression line.
l Construct and interpret confidence and prediction
intervals for the dependent variable.
3
Regression Analysis - Introduction
l Recall in Chapter 4 the idea of showing the
relationship between two variables with a scatter
diagram was introduced.
l In that case we showed that, as the age of the buyer
increased, the amount spent for the vehicle also
increased.
l In this chapter we carry this idea further. Numerical
measures to express the strength of relationship
between two variables are developed.
l In addition, an equation is used to express the
relationship. between variables, allowing us to
estimate one variable on the basis of another.
4
Regression Analysis - Uses
Some examples.
l Is there a relationship between the amount Healthtex
spends per month on advertising and its sales in the
month?
l Can we base an estimate of the cost to heat a home
in January on the number of square feet in the
home?
l Is there a relationship between the miles per gallon
achieved by large pickup trucks and the size of the
engine?
l Is there a relationship between the number of hours
that students studied for an exam and the score
earned?
5
Correlation Analysis
l Correlation Analysis is the study of the
relationship between variables. It is also
defined as group of techniques to measure
the association between two variables.
l A Scatter Diagram is a chart that portrays
the relationship between the two variables. It
is the usual first step in correlations analysis
– The Dependent Variable is the variable being
predicted or estimated.
– The Independent Variable provides the basis for
estimation. It is the predictor variable.
6
Regression Example
The sales manager of Copier Sales
of America, which has a large
sales force throughout the
United States and Canada,
wants to determine whether
there is a relationship between
the number of sales calls made
in a month and the number of
copiers sold that month. The
manager selects a random
sample of 10 representatives
and determines the number of
sales calls each representative
made last month and the
number of copiers sold.
7
Scatter Diagram
8
The Coefficient of Correlation, r
The Coefficient of Correlation (r) is a measure of the
strength of the relationship between two variables. It
requires interval or ratio-scaled data.
l It can range from -1.00 to 1.00.
l Values of -1.00 or 1.00 indicate perfect and strong
correlation.
l Values close to 0.0 indicate weak correlation.
l Negative values indicate an inverse relationship and
positive values indicate a direct relationship.
9
Perfect Correlation
10
Minitab Scatter Plots
11
Correlation Coefficient - Interpretation
12
Correlation Coefficient - Formula
13
Coefficient of Determination
The coefficient of determination (r2) is the
proportion of the total variation in the
dependent variable (Y) that is explained or
accounted for by the variation in the
independent variable (X). It is the square of
the coefficient of correlation.
l It ranges from 0 to 1.
l It does not give any information on the
direction of the relationship between the
variables.
14
Using the Copier Sales of
America data which a
scatterplot was
developed earlier,
compute the correlation
coefficient and
coefficient of
determination.
Correlation Coefficient - Example
15
Correlation Coefficient - Example
16
Correlation Coefficient – Excel Example
17
How do we interpret a correlation of 0.759?
First, it is positive, so we see there is a direct relationship between
the number of sales calls and the number of copiers sold. The value
of 0.759 is fairly close to 1.00, so we conclude that the association
is strong.
However, does this mean that more sales calls cause more sales?
No, we have not demonstrated cause and effect here, only that the
two variables—sales calls and copiers sold—are related.
Correlation Coefficient - Example
18
Coefficient of Determination (r2) - Example
•The coefficient of determination, r2 ,is 0.576,
found by (0.759)2
•This is a proportion or a percent; we can say that
57.6 percent of the variation in the number of
copiers sold is explained, or accounted for, by the
variation in the number of sales calls.
19
Testing the Significance of
the Correlation Coefficient
H0: r = 0 (the correlation in the population is 0)
H1: r ≠ 0 (the correlation in the population is not 0)
Reject H0 if:
t > ta/2,n-2 or t < -ta/2,n-2
20
Testing the Significance of
the Correlation Coefficient - Example
H0: r = 0 (the correlation in the population is 0)
H1: r ≠ 0 (the correlation in the population is not 0)
Reject H0 if:
t > ta/2,n-2 or t < -ta/2,n-2
t > t0.025,8 or t < -t0.025,8
t > 2.306 or t < -2.306
21
Testing the Significance of
the Correlation Coefficient - Example
The computed t (3.297) is within the rejection region, therefore, we will reject H0. This means
the correlation in the population is not zero. From a practical standpoint, it indicates to the
sales manager that there is correlation with respect to the number of sales calls made
and the number of copiers sold in the population of salespeople.
22
Minitab
23
Linear Regression Model
24
Computing the Slope of the Line
25
Computing the Y-Intercept
26
Regression Analysis
In regression analysis we use the independent variable
(X) to estimate the dependent variable (Y).
l The relationship between the variables is linear.
l Both variables must be at least interval scale.
l The least squares criterion is used to determine the
equation.
27
Regression Analysis – Least Squares
Principle
l The least squares principle is used to
obtain a and b.
l The equations to determine a and b
are:
b
n XY X Y
n X X
a
Y
n
b
X
n
=
-
-
= -
( ) ( )( )
( ) ( )
S S S
S S
S S
2 2
28
Illustration of the Least Squares
Regression Principle
29
Regression Equation - Example
Recall the example involving
Copier Sales of America. The
sales manager gathered
information on the number of
sales calls made and the
number of copiers sold for a
random sample of 10 sales
representatives. Use the least
squares method to determine a
linear equation to express the
relationship between the two
variables.
What is the expected number of
copiers sold by a representative
who made 20 calls?
30
Finding the Regression Equation - Example
6316
.
42
)
20
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
:
is
equation
regression
The
^
^
^
^
=
+
=
+
=
+
=
Y
Y
X
Y
bX
a
Y
31
Computing the Estimates of Y
Step 1 – Using the regression equation, substitute the
value of each X to solve for the estimated sales
4736
.
54
)
30
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
Jones
Soni
^
^
^
=
+
=
+
=
Y
Y
X
Y
6316
.
42
)
20
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
Keller
Tom
^
^
^
=
+
=
+
=
Y
Y
X
Y
32
Plotting the Estimated and the Actual Y’s
33
The Standard Error of Estimate
l The standard error of estimate measures the
scatter, or dispersion, of the observed values
around the line of regression
l The formulas that are used to compute the
standard error:
2
)
( 2
^
.
-
-
S
=
n
Y
Y
s x
y
2
2
.
-
S
-
S
-
S
=
n
XY
b
Y
a
Y
s x
y
34
Standard Error of the Estimate - Example
Recall the example involving
Copier Sales of America.
The sales manager
determined the least
squares regression
equation is given below.
Determine the standard error
of estimate as a measure
of how well the values fit
the regression line.
X
Y 1842
.
1
9476
.
18
^
+
=
901
.
9
2
10
211
.
784
2
)
( 2
^
.
=
-
=
-
-
S
=
n
Y
Y
s x
y
35
)
(
^
Y
Y -
Graphical Illustration of the Differences between Actual
Y – Estimated Y
36
Standard Error of the Estimate - Excel
37
Assumptions Underlying Linear
Regression
For each value of X, there is a group of Y values, and these
l Y values are normally distributed. The means of these normal
distributions of Y values all lie on the straight line of regression.
l The standard deviations of these normal distributions are equal.
l The Y values are statistically independent. This means that in
the selection of a sample, the Y values chosen for a particular X
value do not depend on the Y values for any other X values.
38
Confidence Interval and Prediction
Interval Estimates of Y
•A confidence interval reports the mean value of Y
for a given X.
•A prediction interval reports the range of values
of Y for a particular value of X.
39
Confidence Interval Estimate - Example
We return to the Copier Sales of America
illustration. Determine a 95 percent confidence
interval for all sales representatives who make
25 calls.
40
Step 1 – Compute the point estimate of Y
In other words, determine the number of copiers we expect a sales
representative to sell if he or she makes 25 calls.
5526
.
48
)
25
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
:
is
equation
regression
The
^
^
^
=
+
=
+
=
Y
Y
X
Y
Confidence Interval Estimate - Example
41
Step 2 – Find the value of t
l To find the t value, we need to first know the number
of degrees of freedom. In this case the degrees of
freedom is n - 2 = 10 – 2 = 8.
l We set the confidence level at 95 percent. To find the
value of t, move down the left-hand column of
Appendix B.2 to 8 degrees of freedom, then move
across to the column with the 95 percent level of
confidence.
l The value of t is 2.306.
Confidence Interval Estimate - Example
42
Confidence Interval Estimate - Example
43
Confidence Interval Estimate - Example
Step 4 – Use the formula above by substituting the numbers computed
in previous slides
Thus, the 95 percent confidence interval for the average sales of all
sales representatives who make 25 calls is from 40.9170 up to
56.1882 copiers.
44
Prediction Interval Estimate - Example
We return to the Copier Sales of America
illustration. Determine a 95 percent
prediction interval for Sheila Baker, a West
Coast sales representative who made 25
calls.
45
Step 1 – Compute the point estimate of Y
In other words, determine the number of copiers we
expect a sales representative to sell if he or she
makes 25 calls.
5526
.
48
)
25
(
1842
.
1
9476
.
18
1842
.
1
9476
.
18
:
is
equation
regression
The
^
^
^
=
+
=
+
=
Y
Y
X
Y
Prediction Interval Estimate - Example
46
Step 2 – Using the information computed
earlier in the confidence interval estimation
example, use the formula above.
Prediction Interval Estimate - Example
If Sheila Baker makes 25 sales calls, the number of copiers she
will sell will be between about 24 and 73 copiers.
47
Confidence and Prediction Intervals –
Minitab Illustration
48
Transforming Data
l The coefficient of correlation describes the
strength of the linear relationship between
two variables. It could be that two variables
are closely related, but there relationship is
not linear.
l Be cautious when you are interpreting the
coefficient of correlation. A value of r may
indicate there is no linear relationship, but it
could be there is a relationship of some other
nonlinear or curvilinear form.
49
Transforming Data - Example
On the right is a listing of 22 professional
golfers, the number of events in
which they participated, the amount
of their winnings, and their mean
score for the 2004 season. In golf,
the objective is to play 18 holes in
the least number of strokes. So, we
would expect that those golfers with
the lower mean scores would have
the larger winnings. To put it another
way, score and winnings should be
inversely related. In 2004 Tiger
Woods played in 19 events, earned
$5,365,472, and had a mean score
per round of 69.04. Fred Couples
played in 16 events, earned
$1,396,109, and had a mean score
per round of 70.92. The data for the
22 golfers follows.
50
Scatterplot of Golf Data
l The correlation between the
variables Winnings and
Score is 0.782. This is a
fairly strong inverse
relationship.
l However, when we plot the
data on a scatter diagram
the relationship does not
appear to be linear; it does
not seem to follow a straight
line.
51
What can we do to explore other (nonlinear)
relationships?
One possibility is to transform one of the
variables. For example, instead of using Y as
the dependent variable, we might use its log,
reciprocal, square, or square root. Another
possibility is to transform the independent
variable in the same way. There are other
transformations, but these are the most
common.
52
In the golf winnings
example, changing the
scale of the dependent
variable is effective. We
determine the log of each
golfer’s winnings and
then find the correlation
between the log of
winnings and score. That
is, we find the log to the
base 10 of Tiger Woods’
earnings of $5,365,472,
which is 6.72961.
Transforming Data - Example
53
Scatter Plot of Transformed Y
54
Linear Regression Using the
Transformed Y
55
Using the Transformed Equation for
Estimation
Based on the regression equation, a golfer with
a mean score of 70 could expect to earn:
•The value 6.4372 is the log to the base 10 of winnings.
•The antilog of 6.4372 is 2.736
•So a golfer that had a mean score of 70 could expect to
earn $2,736,528.
56
End of Chapter 13
©The McGraw-Hill Companies, Inc. 2008
McGraw-Hill/Irwin
Multiple Linear Regression and
Correlation Analysis
Chapter 14
2
GOALS
l Describe the relationship between several independent variables and
a dependent variable using multiple regression analysis.
l Set up, interpret, and apply an ANOVA table
l Compute and interpret the multiple standard error of estimate, the
coefficient of multiple determination, and the adjusted coefficient of
multiple determination.
l Conduct a test of hypothesis to determine whether regression
coefficients differ from zero.
l Conduct a test of hypothesis on each of the regression coefficients.
l Use residual analysis to evaluate the assumptions of multiple
regression analysis.
l Evaluate the effects of correlated independent variables.
l Use and understand qualitative independent variables.
l Understand and interpret the stepwise regression method.
l Understand and interpret possible interaction among independent
variables.
3
Multiple Regression Analysis
The general multiple regression with k
independent variables is given by:
The least squares criterion is used to develop
this equation. Because determining b1, b2, etc. is
very tedious, a software package such as Excel
or MINITAB is recommended.
4
Multiple Regression Analysis
For two independent variables, the general form
of the multiple regression equation is:
•X1 and X2 are the independent variables.
•a is the Y-intercept
•b1 is the net change in Y for each unit change in X1 holding X2
constant. It is called a partial regression coefficient, a net regression
coefficient, or just a regression coefficient.
5
Regression Plane for a 2-Independent
Variable Linear Regression Equation
6
Salsberry Realty sells homes along the east
coast of the United States. One of the
questions most frequently asked by
prospective buyers is: If we purchase this
home, how much can we expect to pay to
heat it during the winter? The research
department at Salsberry has been asked to
develop some guidelines regarding heating
costs for single-family homes.
Three variables are thought to relate to the
heating costs: (1) the mean daily outside
temperature, (2) the number of inches of
insulation in the attic, and (3) the age in
years of the furnace.
To investigate, Salsberry’s research department
selected a random sample of 20 recently
sold homes. It determined the cost to heat
each home last January, as well
Multiple Linear Regression - Example
7
Multiple Linear Regression - Example
8
Multiple Linear Regression – Minitab
Example
9
Multiple Linear Regression – Excel
Example
10
The Multiple Regression Equation –
Interpreting the Regression Coefficients
The regression coefficient for mean outside temperature is 4.583. The coefficient is
negative and shows an inverse relationship between heating cost and temperature.
As the outside temperature increases, the cost to heat the home decreases. The
numeric value of the regression coefficient provides more information. If we
increase temperature by 1 degree and hold the other two independent variables
constant, we can estimate a decrease of $4.583 in monthly heating cost. So if the
mean temperature in Boston is 25 degrees and it is 35 degrees in Philadelphia, all
other things being the same (insulation and age of furnace), we expect the heating
cost would be $45.83 less in Philadelphia.
The attic insulation variable also shows an inverse relationship: the more insulation in
the attic, the less the cost to heat the home. So the negative sign for this coefficient
is logical. For each additional inch of insulation, we expect the cost to heat the
home to decline $14.83 per month, regardless of the outside temperature or the
age of the furnace.
The age of the furnace variable shows a direct relationship. With an older furnace, the
cost to heat the home increases. Specifically, for each additional year older the
furnace is, we expect the cost to increase $6.10 per month.
11
Applying the Model for Estimation
What is the estimated heating cost for a home if the
mean outside temperature is 30 degrees, there
are 5 inches of insulation in the attic, and the
furnace is 10 years old?
12
Multiple Standard Error of
Estimate
The multiple standard error of estimate is a measure of the
effectiveness of the regression equation.
l It is measured in the same units as the dependent
variable.
l It is difficult to determine what is a large value and what
is a small value of the standard error.
l The formula is:
13
14
Multiple Regression and
Correlation Assumptions
l The independent variables and the dependent
variable have a linear relationship. The dependent
variable must be continuous and at least interval-
scale.
l The residual must be the same for all values of Y.
When this is the case, we say the difference exhibits
homoscedasticity.
l The residuals should follow the normal distributed
with mean 0.
l Successive values of the dependent variable must
be uncorrelated.
15
The ANOVA Table
The ANOVA table reports the variation in the
dependent variable. The variation is divided
into two components.
l The Explained Variation is that accounted for
by the set of independent variable.
l The Unexplained or Random Variation is not
accounted for by the independent variables.
16
Minitab – the ANOVA Table
17
Coefficient of Multiple Determination (r2)
Characteristics of the coefficient of multiple determination:
1. It is symbolized by a capital R squared. In other words, it is written
as because it behaves like the square of a correlation coefficient.
2. It can range from 0 to 1. A value near 0 indicates little association
between the set of independent variables and the dependent
variable. A value near 1 means a strong association.
3. It cannot assume negative values. Any number that is squared or
raised to the second power cannot be negative.
4. It is easy to interpret. Because is a value between 0 and 1 it is easy
to interpret, compare, and understand.
18
Minitab – the ANOVA Table
804
.
0
916
,
212
220
,
171
total
2
=
=
=
SS
SSR
R
19
Adjusted Coefficient of Determination
l The number of independent variables in a multiple
regression equation makes the coefficient of
determination larger. Each new independent variable
causes the predictions to be more accurate.
l If the number of variables, k, and the sample size, n,
are equal, the coefficient of determination is 1.0. In
practice, this situation is rare and would also be
ethically questionable.
l To balance the effect that the number of independent
variables has on the coefficient of multiple
determination, statistical software packages use an
adjusted coefficient of multiple determination.
20
21
Correlation Matrix
A correlation matrix is used to show all
possible simple correlation
coefficients among the variables.
l The matrix is useful for locating
correlated independent variables.
l It shows how strongly each
independent variable is correlated
with the dependent variable.
22
Global Test: Testing the Multiple
Regression Model
The global test is used to investigate
whether any of the independent
variables have significant coefficients.
The hypotheses are:
0
equal
s
all
Not
:
0
...
:
1
2
1
0
b
b
b
b
H
H k =
=
=
=
23
Global Test continued
l The test statistic is the F
distribution with k (number of
independent variables) and
n-(k+1) degrees of freedom, where
n is the sample size.
l Decision Rule:
Reject H0 if F > Fa,k,n-k-1
24
Finding the Critical F
25
Finding the Computed F
26
Interpretation
l The computed value of F is
21.90, which is in the rejection
region.
l The null hypothesis that all the
multiple regression coefficients
are zero is therefore rejected.
l Interpretation: some of the
independent variables (amount
of insulation, etc.) do have the
ability to explain the variation in
the dependent variable (heating
cost).
l Logical question – which ones?
27
Evaluating Individual Regression
Coefficients (βi = 0)
l This test is used to determine which independent variables have nonzero
regression coefficients.
l The variables that have zero regression coefficients are usually dropped
from the analysis.
l The test statistic is the t distribution with n-(k+1) degrees of freedom.
l The hypothesis test is as follows:
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
28
Critical t-stat for the Slopes
-2.120 2.120
29
Computed t-stat for the Slopes
30
Conclusion on Significance of Slopes
31
New Regression Model without
Variable “Age” – Minitab
32
New Regression Model without Variable
“Age” – Minitab
33
Testing the New Model for Significance
34
Critical t-stat for the New Slopes
110
.
2
0
110
.
2
0
0
0
0
0
0
0
:
if
H
Reject
17
,
025
.
17
,
025
.
1
2
20
,
2
/
05
.
1
2
20
,
2
/
05
.
1
,
2
/
1
,
2
/
1
,
2
/
1
,
2
/
0
-
<
-
>
-
-
<
-
>
-
-
<
-
>
-
-
<
-
>
-
-
<
>
-
-
-
-
-
-
-
-
-
-
-
-
i
i
i
i
i
i
i
i
b
i
b
i
b
i
b
i
b
i
b
i
k
n
b
i
k
n
b
i
k
n
k
n
s
b
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
t
t
t
a
a
a
a
-2.110 2.110
35
Conclusion on Significance of New
Slopes
36
Evaluating the
Assumptions of Multiple Regression
1. There is a linear relationship. That is, there is a straight-line
relationship between the dependent variable and the set of
independent variables.
2. The variation in the residuals is the same for both large and
small values of the estimated Y To put it another way, the
residual is unrelated whether the estimated Y is large or small.
3. The residuals follow the normal probability distribution.
4. The independent variables should not be correlated. That is,
we would like to select a set of independent variables that are
not themselves correlated.
5. The residuals are independent. This means that successive
observations of the dependent variable are not correlated. This
assumption is often violated when time is involved with the
sampled observations.
37
Analysis of Residuals
A residual is the difference between the
actual value of Y and the predicted
value of Y. Residuals should be
approximately normally distributed.
Histograms and stem-and-leaf charts
are useful in checking this requirement.
l A plot of the residuals and their
corresponding Y’ values is used for
showing that there are no trends or
patterns in the residuals.
38
Scatter Diagram
39
Residual Plot
40
Distribution of Residuals
Both MINITAB and Excel offer another graph that helps to evaluate the
assumption of normally distributed residuals. It is a called a normal
probability plot and is shown to the right of the histogram.
41
Multicollinearity
l Multicollinearity exists when independent
variables (X’s) are correlated.
l Correlated independent variables make it
difficult to make inferences about the
individual regression coefficients (slopes)
and their individual effects on the dependent
variable (Y).
l However, correlated independent variables
do not affect a multiple regression equation’s
ability to predict the dependent variable (Y).
42
Variance Inflation Factor
l A general rule is if the correlation between two independent
variables is between -0.70 and 0.70 there likely is not a problem
using both of the independent variables.
l A more precise test is to use the variance inflation factor (VIF).
l The value of VIF is found as follows:
•The term R2
j refers to the coefficient of determination, where the selected
independent variable is used as a dependent variable and the remaining
independent variables are used as independent variables.
•A VIF greater than 10 is considered unsatisfactory, indicating that
independent variable should be removed from the analysis.
43
Multicollinearity – Example
Refer to the data in the
table, which relates the
heating cost to the
independent variables
outside temperature,
amount of insulation,
and age of furnace.
Develop a correlation
matrix for all the
independent variables.
Does it appear there is a
problem with
multicollinearity?
Find and interpret the
variance inflation factor
for each of the
independent variables.
44
Correlation Matrix - Minitab
45
VIF – Minitab Example
The VIF value of 1.32 is less than the upper limit
of 10. This indicates that the independent variable
temperature is not strongly correlated with the
other independent variables.
Coefficient of
Determination
46
Independence Assumption
l The fifth assumption about regression and
correlation analysis is that successive
residuals should be independent.
l When successive residuals are correlated we
refer to this condition as autocorrelation.
Autocorrelation frequently occurs when the
data are collected over a period of time.
47
Residual Plot versus Fitted Values
l The graph below shows the
residuals plotted on the
vertical axis and the fitted
values on the horizontal
axis.
l Note the run of residuals
above the mean of the
residuals, followed by a run
below the mean. A scatter
plot such as this would
indicate possible
autocorrelation.
48
Qualitative Independent Variables
l Frequently we wish to use nominal-scale
variables—such as gender, whether the
home has a swimming pool, or whether the
sports team was the home or the visiting
team—in our analysis. These are called
qualitative variables.
l To use a qualitative variable in regression
analysis, we use a scheme of dummy
variables in which one of the two possible
conditions is coded 0 and the other 1.
49
Qualitative Variable - Example
Suppose in the Salsberry
Realty example that the
independent variable
“garage” is added. For those
homes without an attached
garage, 0 is used; for homes
with an attached garage, a 1
is used. We will refer to the
“garage” variable as The
data from Table 14–2 are
entered into the MINITAB
system.
50
Qualitative Variable - Minitab
51
Using the Model for Estimation
What is the effect of the garage variable? Suppose we have two houses exactly
alike next to each other in Buffalo, New York; one has an attached garage,
and the other does not. Both homes have 3 inches of insulation, and the
mean January temperature in Buffalo is 20 degrees.
For the house without an attached garage, a 0 is substituted for in the regression
equation. The estimated heating cost is $280.90, found by:
For the house with an attached garage, a 1 is substituted for in the regression
equation. The estimated heating cost is $358.30, found by:
Without garage
With garage
52
Testing the Model for Significance
l We have shown the difference between the two
types of homes to be $77.40, but is the difference
significant?
l We conduct the following test of hypothesis.
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
53
Evaluating Individual Regression
Coefficients (βi = 0)
l This test is used to determine which independent variables have nonzero
regression coefficients.
l The variables that have zero regression coefficients are usually dropped
from the analysis.
l The test statistic is the t distribution with
n-(k+1) or n-k-1degrees of freedom.
l The hypothesis test is as follows:
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
54
120
.
2
0
120
.
2
0
0
0
0
0
0
0
:
if
H
Reject
16
,
025
.
16
,
025
.
1
3
20
,
2
/
05
.
1
3
20
,
2
/
05
.
1
,
2
/
1
,
2
/
1
,
2
/
1
,
2
/
0
-
<
-
>
-
-
<
-
>
-
-
<
-
>
-
-
<
-
>
-
-
<
>
-
-
-
-
-
-
-
-
-
-
-
-
i
i
i
i
i
i
i
i
b
i
b
i
b
i
b
i
b
i
b
i
k
n
b
i
k
n
b
i
k
n
k
n
s
b
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
s
b
t
t
t
t
a
a
a
a
Conclusion: The regression coefficient is not zero. The independent
variable garage should be included in the analysis.
55
Stepwise Regression
The advantages to the stepwise method are:
1. Only independent variables with significant regression
coefficients are entered into the equation.
2. The steps involved in building the regression equation are clear.
3. It is efficient in finding the regression equation with only
significant regression coefficients.
4. The changes in the multiple standard error of estimate and the
coefficient of determination are shown.
56
The stepwise MINITAB output for the heating cost
problem follows.
Temperature is
selected first. This
variable explains
more of the
variation in heating
cost than any of the
other three
proposed
independent
variables.
Garage is selected
next, followed by
Insulation.
Stepwise Regression – Minitab Example
57
Regression Models with Interaction
l In Chapter 12 we discussed interaction among independent variables.
To explain, suppose we are studying weight loss and assume, as the
current literature suggests, that diet and exercise are related. So the
dependent variable is amount of change in weight and the
independent variables are: diet (yes or no) and exercise (none,
moderate, significant). We are interested in whether there is interaction
among the independent variables. That is, if those studied maintain
their diet and exercise significantly, will that increase the mean amount
of weight lost? Is total weight loss more than the sum of the loss due to
the diet effect and the loss due to the exercise effect?
l In regression analysis, interaction can be examined as a separate
independent variable. An interaction prediction variable can be
developed by multiplying the data values in one independent variable
by the values in another independent variable, thereby creating a new
independent variable. A two-variable model that includes an interaction
term is:
58
Refer to the heating cost
example. Is there an
interaction between
the outside
temperature and the
amount of insulation?
If both variables are
increased, is the
effect on heating cost
greater than the sum
of savings from
warmer temperature
and the savings from
increased insulation
separately?
Regression Models with Interaction -
Example
59
Creating the Interaction Variable – Using the
information from the table in the previous slide, an
interaction variable is created by multiplying the
temperature variable by the insulation.
For the first sampled home the value temperature is 35
degrees and insulation is 3 inches so the value of
the interaction variable is 35 X 3 = 105. The values
of the other interaction products are found in a
similar fashion.
Regression Models with Interaction -
Example
60
Regression Models with Interaction -
Example
61
The regression equation is:
Is the interaction variable significant at 0.05
significance level?
Regression Models with Interaction -
Example
62
There are other situations that can occur when studying
interaction among independent variables.
1. It is possible to have a three-way interaction among
the independent variables. In the heating example,
we might have considered the three-way interaction
between temperature, insulation, and age of the
furnace.
2. It is possible to have an interaction where one of the
independent variables is nominal scale. In our
heating cost example, we could have studied the
interaction between temperature and garage.
63
End of Chapter 14

More Related Content

Similar to 1667390753_Lind Chapter 10-14.pdf

The following calendar-year information is taken from the December.docx
The following calendar-year information is taken from the December.docxThe following calendar-year information is taken from the December.docx
The following calendar-year information is taken from the December.docx
cherry686017
 
Hypothesis Testing techniques in social research.ppt
Hypothesis Testing techniques in social research.pptHypothesis Testing techniques in social research.ppt
Hypothesis Testing techniques in social research.ppt
Solomonkiplimo
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
Nirajan Bam
 
Lecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two SampleLecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two Sample
Ahmadullah
 
Estimation and hypothesis testing 1 (graduate statistics2)
Estimation and hypothesis testing 1 (graduate statistics2)Estimation and hypothesis testing 1 (graduate statistics2)
Estimation and hypothesis testing 1 (graduate statistics2)
Harve Abella
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
bmcfad01
 
1192012 155942 f023_=_statistical_inference
1192012 155942 f023_=_statistical_inference1192012 155942 f023_=_statistical_inference
1192012 155942 f023_=_statistical_inference
Dev Pandey
 
Lesson05_Static11
Lesson05_Static11Lesson05_Static11
Lesson05_Static11
thangv
 
Course Project, Part IIntroduction(REMOVE THIS LINE PRIOR
Course Project, Part IIntroduction(REMOVE THIS LINE PRIOR Course Project, Part IIntroduction(REMOVE THIS LINE PRIOR
Course Project, Part IIntroduction(REMOVE THIS LINE PRIOR
CruzIbarra161
 

Similar to 1667390753_Lind Chapter 10-14.pdf (20)

Telesidang 4 bab_8_9_10stst
Telesidang 4 bab_8_9_10ststTelesidang 4 bab_8_9_10stst
Telesidang 4 bab_8_9_10stst
 
IPPTCh010.pptx
IPPTCh010.pptxIPPTCh010.pptx
IPPTCh010.pptx
 
8. testing of hypothesis for variable &amp; attribute data
8. testing of hypothesis for variable &amp; attribute  data8. testing of hypothesis for variable &amp; attribute  data
8. testing of hypothesis for variable &amp; attribute data
 
The following calendar-year information is taken from the December.docx
The following calendar-year information is taken from the December.docxThe following calendar-year information is taken from the December.docx
The following calendar-year information is taken from the December.docx
 
Hypothesis Testing techniques in social research.ppt
Hypothesis Testing techniques in social research.pptHypothesis Testing techniques in social research.ppt
Hypothesis Testing techniques in social research.ppt
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Statr session 15 and 16
Statr session 15 and 16Statr session 15 and 16
Statr session 15 and 16
 
hypothesis_testing-ch9-39-14402.pdf
hypothesis_testing-ch9-39-14402.pdfhypothesis_testing-ch9-39-14402.pdf
hypothesis_testing-ch9-39-14402.pdf
 
Lecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two SampleLecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two Sample
 
Estimation and hypothesis testing 1 (graduate statistics2)
Estimation and hypothesis testing 1 (graduate statistics2)Estimation and hypothesis testing 1 (graduate statistics2)
Estimation and hypothesis testing 1 (graduate statistics2)
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 
1192012 155942 f023_=_statistical_inference
1192012 155942 f023_=_statistical_inference1192012 155942 f023_=_statistical_inference
1192012 155942 f023_=_statistical_inference
 
Lesson05_Static11
Lesson05_Static11Lesson05_Static11
Lesson05_Static11
 
Testing of hypothesis - large sample test
Testing of hypothesis - large sample testTesting of hypothesis - large sample test
Testing of hypothesis - large sample test
 
Chapter 15
Chapter 15 Chapter 15
Chapter 15
 
Significance Tests
Significance TestsSignificance Tests
Significance Tests
 
Course Project, Part IIntroduction(REMOVE THIS LINE PRIOR
Course Project, Part IIntroduction(REMOVE THIS LINE PRIOR Course Project, Part IIntroduction(REMOVE THIS LINE PRIOR
Course Project, Part IIntroduction(REMOVE THIS LINE PRIOR
 
Decision analysis
Decision analysisDecision analysis
Decision analysis
 
Decision analysis
Decision analysisDecision analysis
Decision analysis
 
Decision analysis
Decision analysisDecision analysis
Decision analysis
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

1667390753_Lind Chapter 10-14.pdf

  • 1. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin One Sample Tests of Hypothesis Chapter 10
  • 2. 2 GOALS l Define a hypothesis and hypothesis testing. l Describe the five-step hypothesis-testing procedure. l Distinguish between a one-tailed and a two-tailed test of hypothesis. l Conduct a test of hypothesis about a population mean. l Conduct a test of hypothesis about a population proportion. l Define Type I and Type II errors. l Compute the probability of a Type II error.
  • 3. 3 What is a Hypothesis? A Hypothesis is a statement about the value of a population parameter developed for the purpose of testing. Examples of hypotheses made about a population parameter are: – The mean monthly income for systems analysts is $3,625. – Twenty percent of all customers at Bovine’s Chop House return for another meal within a month.
  • 4. 4 What is Hypothesis Testing? Hypothesis testing is a procedure, based on sample evidence and probability theory, used to determine whether the hypothesis is a reasonable statement and should not be rejected, or is unreasonable and should be rejected.
  • 6. 6 Important Things to Remember about H0 and H1 l H0: null hypothesis and H1: alternate hypothesis l H0 and H1 are mutually exclusive and collectively exhaustive l H0 is always presumed to be true l H1 has the burden of proof l A random sample (n) is used to “reject H0” l If we conclude 'do not reject H0', this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence to reject H0; rejecting the null hypothesis then, suggests that the alternative hypothesis may be true. l Equality is always part of H0 (e.g. “=” , “≥” , “≤”). l “≠” “<” and “>” always part of H1
  • 7. 7 How to Set Up a Claim as Hypothesis l In actual practice, the status quo is set up as H0 l If the claim is “boastful” the claim is set up as H1 (we apply the Missouri rule – “show me”). Remember, H1 has the burden of proof l In problem solving, look for key words and convert them into symbols. Some key words include: “improved, better than, as effective as, different from, has changed, etc.”
  • 8. 8 Left-tail or Right-tail Test? Keywords Inequality Symbol Part of: Larger (or more) than > H1 Smaller (or less) < H1 No more than £ H0 At least ≥ H0 Has increased > H1 Is there difference? ≠ H1 Has not changed = H0 Has “improved”, “is better than”. “is more effective” See right H1 • The direction of the test involving claims that use the words “has improved”, “is better than”, and the like will depend upon the variable being measured. • For instance, if the variable involves time for a certain medication to take effect, the words “better” “improve” or more effective” are translated as “<” (less than, i.e. faster relief). • On the other hand, if the variable refers to a test score, then the words “better” “improve” or more effective” are translated as “>” (greater than, i.e. higher test scores)
  • 9. 9
  • 10. 10 Parts of a Distribution in Hypothesis Testing
  • 12. 12 Hypothesis Setups for Testing a Mean (m)
  • 13. 13 Hypothesis Setups for Testing a Proportion (p)
  • 14. 14 Testing for a Population Mean with a Known Population Standard Deviation- Example Jamestown Steel Company manufactures and assembles desks and other office equipment at several plants in western New York State. The weekly production of the Model A325 desk at the Fredonia Plant follows the normal probability distribution with a mean of 200 and a standard deviation of 16. Recently, because of market expansion, new production methods have been introduced and new employees hired. The vice president of manufacturing would like to investigate whether there has been a change in the weekly production of the Model A325 desk.
  • 15. 15 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 1: State the null hypothesis and the alternate hypothesis. H0: m = 200 H1: m ≠ 200 (note: keyword in the problem “has changed”) Step 2: Select the level of significance. α = 0.01 as stated in the problem Step 3: Select the test statistic. Use Z-distribution since σ is known
  • 16. 16 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 4: Formulate the decision rule. Reject H0 if |Z| > Za/2 58 . 2 not is 55 . 1 50 / 16 200 5 . 203 / 2 / 01 . 2 / 2 / > > - > - > Z Z n X Z Z a a s m Step 5: Make a decision and interpret the result. Because 1.55 does not fall in the rejection region, H0 is not rejected. We conclude that the population mean is not different from 200. So we would report to the vice president of manufacturing that the sample evidence does not show that the production rate at the Fredonia Plant has changed from 200 per week.
  • 17. 17 Suppose in the previous problem the vice president wants to know whether there has been an increase in the number of units assembled. To put it another way, can we conclude, because of the improved production methods, that the mean number of desks assembled in the last 50 weeks was more than 200? Recall: σ=16, n=200, α=.01 Testing for a Population Mean with a Known Population Standard Deviation- Another Example
  • 18. 18 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 1: State the null hypothesis and the alternate hypothesis. H0: m ≤ 200 H1: m > 200 (note: keyword in the problem “an increase”) Step 2: Select the level of significance. α = 0.01 as stated in the problem Step 3: Select the test statistic. Use Z-distribution since σ is known
  • 19. 19 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 4: Formulate the decision rule. Reject H0 if Z > Za Step 5: Make a decision and interpret the result. Because 1.55 does not fall in the rejection region, H0 is not rejected. We conclude that the average number of desks assembled in the last 50 weeks is not more than 200
  • 20. 20 Type of Errors in Hypothesis Testing l Type I Error - – Defined as the probability of rejecting the null hypothesis when it is actually true. – This is denoted by the Greek letter “a” – Also known as the significance level of a test l Type II Error: – Defined as the probability of “accepting” the null hypothesis when it is actually false. – This is denoted by the Greek letter “β”
  • 21. 21 p-Value in Hypothesis Testing l p-VALUE is the probability of observing a sample value as extreme as, or more extreme than, the value observed, given that the null hypothesis is true. l In testing a hypothesis, we can also compare the p- value to with the significance level (a). l If the p-value < significance level, H0 is rejected, else H0 is not rejected.
  • 22. 22 p-Value in Hypothesis Testing - Example Recall the last problem where the hypothesis and decision rules were set up as: H0: m ≤ 200 H1: m > 200 Reject H0 if Z > Za where Z = 1.55 and Za =2.33 Reject H0 if p-value < a 0.0606 is not < 0.01 Conclude: Fail to reject H0
  • 23. 23 What does it mean when p-value < a? (a) .10, we have some evidence that H0 is not true. (b) .05, we have strong evidence that H0 is not true. (c) .01, we have very strong evidence that H0 is not true. (d) .001, we have extremely strong evidence that H0 is not true.
  • 24. 24 Testing for the Population Mean: Population Standard Deviation Unknown l When the population standard deviation (σ) is unknown, the sample standard deviation (s) is used in its place l The t-distribution is used as test statistic, which is computed using the formula:
  • 25. 25 Testing for the Population Mean: Population Standard Deviation Unknown - Example The McFarland Insurance Company Claims Department reports the mean cost to process a claim is $60. An industry comparison showed this amount to be larger than most other insurance companies, so the company instituted cost-cutting measures. To evaluate the effect of the cost-cutting measures, the Supervisor of the Claims Department selected a random sample of 26 claims processed last month. The sample information is reported below. At the .01 significance level is it reasonable a claim is now less than $60?
  • 26. 26 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 1: State the null hypothesis and the alternate hypothesis. H0: m ≥ $60 H1: m < $60 (note: keyword in the problem “now less than”) Step 2: Select the level of significance. α = 0.01 as stated in the problem Step 3: Select the test statistic. Use t-distribution since σ is unknown
  • 28. 28 Testing for the Population Mean: Population Standard Deviation Unknown – Minitab Solution
  • 29. 29 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 5: Make a decision and interpret the result. Because -1.818 does not fall in the rejection region, H0 is not rejected at the .01 significance level. We have not demonstrated that the cost-cutting measures reduced the mean cost per claim to less than $60. The difference of $3.58 ($56.42 - $60) between the sample mean and the population mean could be due to sampling error. Step 4: Formulate the decision rule. Reject H0 if t < -ta,n-1
  • 30. 30 The current rate for producing 5 amp fuses at Neary Electric Co. is 250 per hour. A new machine has been purchased and installed that, according to the supplier, will increase the production rate. A sample of 10 randomly selected hours from last month revealed the mean hourly production on the new machine was 256 units, with a sample standard deviation of 6 per hour. At the .05 significance level can Neary conclude that the new machine is faster? Testing for a Population Mean with an Unknown Population Standard Deviation- Example
  • 31. 31 Testing for a Population Mean with a Known Population Standard Deviation- Example continued Step 1: State the null and the alternate hypothesis. H0: µ ≤ 250; H1: µ > 250 Step 2: Select the level of significance. It is .05. Step 3: Find a test statistic. Use the t distribution because the population standard deviation is not known and the sample size is less than 30.
  • 32. 32 Testing for a Population Mean with a Known Population Standard Deviation- Example continued Step 4: State the decision rule. There are 10 – 1 = 9 degrees of freedom. The null hypothesis is rejected if t > 1.833. Step 5: Make a decision and interpret the results. The null hypothesis is rejected. The mean number produced is more than 250 per hour. 162 . 3 10 6 250 256 = - = - = n s X t m
  • 33. 33 Tests Concerning Proportion l A Proportion is the fraction or percentage that indicates the part of the population or sample having a particular trait of interest. l The sample proportion is denoted by p and is found by x/n l The test statistic is computed as follows:
  • 34. 34 Assumptions in Testing a Population Proportion using the z-Distribution l A random sample is chosen from the population. l It is assumed that the binomial assumptions discussed in Chapter 6 are met: (1) the sample data collected are the result of counts; (2) the outcome of an experiment is classified into one of two mutually exclusive categories—a “success” or a “failure”; (3) the probability of a success is the same for each trial; and (4) the trials are independent l The test we will conduct shortly is appropriate when both np and n(1- p ) are at least 5. l When the above conditions are met, the normal distribution can be used as an approximation to the binomial distribution
  • 35. 35 Test Statistic for Testing a Single Population Proportion n p z ) 1 ( p p p - - = Sample proportion Hypothesized population proportion Sample size
  • 36. 36 Test Statistic for Testing a Single Population Proportion - Example Suppose prior elections in a certain state indicated it is necessary for a candidate for governor to receive at least 80 percent of the vote in the northern section of the state to be elected. The incumbent governor is interested in assessing his chances of returning to office and plans to conduct a survey of 2,000 registered voters in the northern section of the state. Using the hypothesis-testing procedure, assess the governor’s chances of reelection.
  • 37. 37 Test Statistic for Testing a Single Population Proportion - Example Step 1: State the null hypothesis and the alternate hypothesis. H0: p ≥ .80 H1: p < .80 (note: keyword in the problem “at least”) Step 2: Select the level of significance. α = 0.01 as stated in the problem Step 3: Select the test statistic. Use Z-distribution since the assumptions are met and np and n(1-p) ≥ 5
  • 38. 38 Testing for a Population Proportion - Example Step 5: Make a decision and interpret the result. The computed value of z (2.80) is in the rejection region, so the null hypothesis is rejected at the .05 level. The difference of 2.5 percentage points between the sample percent (77.5 percent) and the hypothesized population percent (80) is statistically significant. The evidence at this point does not support the claim that the incumbent governor will return to the governor’s mansion for another four years. Step 4: Formulate the decision rule. Reject H0 if Z <-Za
  • 39. 39 Type II Error l Recall Type I Error, the level of significance, denoted by the Greek letter “a”, is defined as the probability of rejecting the null hypothesis when it is actually true. l Type II Error, denoted by the Greek letter “β”,is defined as the probability of “accepting” the null hypothesis when it is actually false.
  • 40. 40 Type II Error - Example A manufacturer purchases steel bars to make cotter pins. Past experience indicates that the mean tensile strength of all incoming shipments is 10,000 psi and that the standard deviation, σ, is 400 psi. In order to make a decision about incoming shipments of steel bars, the manufacturer set up this rule for the quality- control inspector to follow: “Take a sample of 100 steel bars. At the .05 significance level if the sample mean strength falls between 9,922 psi and 10,078 psi, accept the lot. Otherwise the lot is to be rejected.”
  • 41. 41 Type I and Type II Errors Illustrated
  • 42. 42 Type II Error Computed
  • 43. 43 Type II Errors For Varying Mean Levels
  • 45. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin Two-sample Tests of Hypothesis Chapter 11
  • 46. 2 GOALS l Conduct a test of a hypothesis about the difference between two independent population means. l Conduct a test of a hypothesis about the difference between two population proportions. l Conduct a test of a hypothesis about the mean difference between paired or dependent observations. l Understand the difference between dependent and independent samples.
  • 47. 3 Comparing two populations – Some Examples l Is there a difference in the mean value of residential real estate sold by male agents and female agents in south Florida? l Is there a difference in the mean number of defects produced on the day and the afternoon shifts at Kimble Products? l Is there a difference in the mean number of days absent between young workers (under 21 years of age) and older workers (more than 60 years of age) in the fast-food industry? l Is there is a difference in the proportion of Ohio State University graduates and University of Cincinnati graduates who pass the state Certified Public Accountant Examination on their first attempt? l Is there an increase in the production rate if music is piped into the production area?
  • 48. 4 Comparing Two Population Means l No assumptions about the shape of the populations are required. l The samples are from independent populations. l The formula for computing the value of z is: 2 2 2 1 2 1 2 1 2 1 known are and if or 30 sizes sample if Use n n X X z s s s s + - = > 2 2 2 1 2 1 2 1 2 1 unknown are and if and 30 sizes sample if Use n s n s X X z + - = > s s
  • 49. 5 EXAMPLE 1 The U-Scan facility was recently installed at the Byrne Road Food-Town location. The store manager would like to know if the mean checkout time using the standard checkout method is longer than using the U- Scan. She gathered the following sample information. The time is measured from when the customer enters the line until their bags are in the cart. Hence the time includes both waiting in line and checking out.
  • 50. 6 EXAMPLE 1 continued Step 1: State the null and alternate hypotheses. H0: µS ≤ µU H1: µS > µU Step 2: State the level of significance. The .01 significance level is stated in the problem. Step 3: Find the appropriate test statistic. Because both samples are more than 30, we can use z-distribution as the test statistic.
  • 51. 7 Example 1 continued Step 4: State the decision rule. Reject H0 if Z > Za Z > 2.33
  • 52. 8 Example 1 continued Step 5: Compute the value of z and make a decision 13 . 3 064 . 0 2 . 0 100 30 . 0 50 40 . 0 3 . 5 5 . 5 2 2 2 2 = = + - = + - = u u s s u s n n X X z s s The computed value of 3.13 is larger than the critical value of 2.33. Our decision is to reject the null hypothesis. The difference of .20 minutes between the mean checkout time using the standard method is too large to have occurred by chance. We conclude the U-Scan method is faster.
  • 53. 9 Two-Sample Tests about Proportions Here are several examples. l The vice president of human resources wishes to know whether there is a difference in the proportion of hourly employees who miss more than 5 days of work per year at the Atlanta and the Houston plants. l General Motors is considering a new design for the Pontiac Grand Am. The design is shown to a group of potential buyers under 30 years of age and another group over 60 years of age. Pontiac wishes to know whether there is a difference in the proportion of the two groups who like the new design. l A consultant to the airline industry is investigating the fear of flying among adults. Specifically, the company wishes to know whether there is a difference in the proportion of men versus women who are fearful of flying.
  • 54. 10 Two Sample Tests of Proportions l We investigate whether two samples came from populations with an equal proportion of successes. l The two samples are pooled using the following formula.
  • 55. 11 Two Sample Tests of Proportions continued The value of the test statistic is computed from the following formula.
  • 56. 12 Manelli Perfume Company recently developed a new fragrance that it plans to market under the name Heavenly. A number of market studies indicate that Heavenly has very good market potential. The Sales Department at Manelli is particularly interested in whether there is a difference in the proportions of younger and older women who would purchase Heavenly if it were marketed. There are two independent populations, a population consisting of the younger women and a population consisting of the older women. Each sampled woman will be asked to smell Heavenly and indicate whether she likes the fragrance well enough to purchase a bottle. Two Sample Tests of Proportions - Example
  • 57. 13 Step 1: State the null and alternate hypotheses. H0: p1 = p 2 H1: p 1 ≠ p 2 Step 2: State the level of significance. The .05 significance level is stated in the problem. Step 3: Find the appropriate test statistic. We will use the z-distribution Two Sample Tests of Proportions - Example
  • 58. 14 Step 4: State the decision rule. Reject H0 if Z > Za/2 or Z < - Za/2 Z > 1.96 or Z < -1.96 Two Sample Tests of Proportions - Example
  • 59. 15 Step 5: Compute the value of z and make a decision The computed value of 2.21 is in the area of rejection. Therefore, the null hypothesis is rejected at the .05 significance level. To put it another way, we reject the null hypothesis that the proportion of young women who would purchase Heavenly is equal to the proportion of older women who would purchase Heavenly. Two Sample Tests of Proportions - Example
  • 60. 16 Two Sample Tests of Proportions – Example (Minitab Solution)
  • 61. 17 Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) The t distribution is used as the test statistic if one or more of the samples have less than 30 observations. The required assumptions are: 1. Both populations must follow the normal distribution. 2. The populations must have equal standard deviations. 3. The samples are from independent populations.
  • 62. 18 Small sample test of means continued Finding the value of the test statistic requires two steps. 1. Pool the sample standard deviations. 2. Use the pooled standard deviation in the formula. 2 ) 1 ( ) 1 ( 2 1 2 2 2 2 1 1 2 - + - + - = n n s n s n sp ÷ ÷ ø ö ç ç è æ + - = 2 1 2 2 1 1 1 n n s X X t p
  • 63. 19 Owens Lawn Care, Inc., manufactures and assembles lawnmowers that are shipped to dealers throughout the United States and Canada. Two different procedures have been proposed for mounting the engine on the frame of the lawnmower. The question is: Is there a difference in the mean time to mount the engines on the frames of the lawnmowers? The first procedure was developed by longtime Owens employee Herb Welles (designated as procedure 1), and the other procedure was developed by Owens Vice President of Engineering William Atkins (designated as procedure 2). To evaluate the two methods, it was decided to conduct a time and motion study. A sample of five employees was timed using the Welles method and six using the Atkins method. The results, in minutes, are shown on the right. Is there a difference in the mean mounting times? Use the .10 significance level. Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test)
  • 64. 20 Step 1: State the null and alternate hypotheses. H0: µ1 = µ2 H1: µ1 ≠ µ2 Step 2: State the level of significance. The .10 significance level is stated in the problem. Step 3: Find the appropriate test statistic. Because the population standard deviations are not known but are assumed to be equal, we use the pooled t-test. Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example
  • 65. 21 Step 4: State the decision rule. Reject H0 if t > ta/2,n1+n2-2 or t < - ta/2,n1+n2-2 t > t.05,9 or t < - t.05,9 t > 1.833 or t < - 1.833 Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example
  • 66. 22 Step 5: Compute the value of t and make a decision (a) Calculate the sample standard deviations Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example
  • 67. 23 Step 5: Compute the value of t and make a decision Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example -0.662 The decision is not to reject the null hypothesis, because 0.662 falls in the region between -1.833 and 1.833. We conclude that there is no difference in the mean times to mount the engine on the frame using the two methods.
  • 68. 24 Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example
  • 69. 25 Comparing Population Means with Unequal Population Standard Deviations If it is not reasonable to assume the population standard deviations are equal, then we compute the t- statistic shown on the right. The sample standard deviations s1 and s2 are used in place of the respective population standard deviations. In addition, the degrees of freedom are adjusted downward by a rather complex approximation formula. The effect is to reduce the number of degrees of freedom in the test, which will require a larger value of the test statistic to reject the null hypothesis.
  • 70. 26 Comparing Population Means with Unequal Population Standard Deviations - Example Personnel in a consumer testing laboratory are evaluating the absorbency of paper towels. They wish to compare a set of store brand towels to a similar group of name brand ones. For each brand they dip a ply of the paper into a tub of fluid, allow the paper to drain back into the vat for two minutes, and then evaluate the amount of liquid the paper has taken up from the vat. A random sample of 9 store brand paper towels absorbed the following amounts of liquid in milliliters. 8 8 3 1 9 7 5 5 12 An independent random sample of 12 name brand towels absorbed the following amounts of liquid in milliliters: 12 11 10 6 8 9 9 10 11 9 8 10 Use the .10 significance level and test if there is a difference in the mean amount of liquid absorbed by the two types of paper towels.
  • 71. 27 The following dot plot provided by MINITAB shows the variances to be unequal. Comparing Population Means with Unequal Population Standard Deviations - Example
  • 72. 28 Step 1: State the null and alternate hypotheses. H0: m1 = m2 H1: m1 ≠ m2 Step 2: State the level of significance. The .10 significance level is stated in the problem. Step 3: Find the appropriate test statistic. We will use unequal variances t-test Comparing Population Means with Unequal Population Standard Deviations - Example
  • 73. 29 Step 4: State the decision rule. Reject H0 if t > ta/2d.f. or t < - ta/2,d.f. t > t.05,10 or t < - t.05, 10 t > 1.812 or t < -1.812 Step 5: Compute the value of t and make a decision The computed value of t is less than the lower critical value, so our decision is to reject the null hypothesis. We conclude that the mean absorption rate for the two towels is not the same. Comparing Population Means with Unequal Population Standard Deviations - Example
  • 75. 31 Two-Sample Tests of Hypothesis: Dependent Samples Dependent samples are samples that are paired or related in some fashion. For example: – If you wished to buy a car you would look at the same car at two (or more) different dealerships and compare the prices. – If you wished to measure the effectiveness of a new diet you would weigh the dieters at the start and at the finish of the program.
  • 76. 32 Hypothesis Testing Involving Paired Observations Use the following test when the samples are dependent: t d s n d = / d Where is the mean of the differences sd is the standard deviation of the differences n is the number of pairs (differences)
  • 77. 33 Nickel Savings and Loan wishes to compare the two companies it uses to appraise the value of residential homes. Nickel Savings selected a sample of 10 residential properties and scheduled both firms for an appraisal. The results, reported in $000, are shown on the table (right). At the .05 significance level, can we conclude there is a difference in the mean appraised values of the homes? Hypothesis Testing Involving Paired Observations - Example
  • 78. 34 Step 1: State the null and alternate hypotheses. H0: md = 0 H1: md ≠ 0 Step 2: State the level of significance. The .05 significance level is stated in the problem. Step 3: Find the appropriate test statistic. We will use the t-test Hypothesis Testing Involving Paired Observations - Example
  • 79. 35 Step 4: State the decision rule. Reject H0 if t > ta/2, n-1 or t < - ta/2,n-1 t > t.025,9 or t < - t.025, 9 t > 2.262 or t < -2.262 Hypothesis Testing Involving Paired Observations - Example
  • 80. 36 Step 5: Compute the value of t and make a decision The computed value of t is greater than the higher critical value, so our decision is to reject the null hypothesis. We conclude that there is a difference in the mean appraised values of the homes. Hypothesis Testing Involving Paired Observations - Example
  • 81. 37 Hypothesis Testing Involving Paired Observations – Excel Example
  • 83. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin Analysis of Variance Chapter 12
  • 84. 2 GOALS l List the characteristics of the F distribution. l Conduct a test of hypothesis to determine whether the variances of two populations are equal. l Discuss the general idea of analysis of variance. l Organize data into a one-way and a two-way ANOVA table. l Conduct a test of hypothesis among three or more treatment means. l Develop confidence intervals for the difference in treatment means. l Conduct a test of hypothesis among treatment means using a blocking variable. l Conduct a two-way ANOVA with interaction.
  • 85. 3 Characteristics of F-Distribution l There is a “family” of F Distributions. l Each member of the family is determined by two parameters: the numerator degrees of freedom and the denominator degrees of freedom. l F cannot be negative, and it is a continuous distribution. l The F distribution is positively skewed. l Its values range from 0 to ¥ l As F ® ¥ the curve approaches the X-axis.
  • 86. 4 Comparing Two Population Variances The F distribution is used to test the hypothesis that the variance of one normal population equals the variance of another normal population. The following examples will show the use of the test: l Two Barth shearing machines are set to produce steel bars of the same length. The bars, therefore, should have the same mean length. We want to ensure that in addition to having the same mean length they also have similar variation. l The mean rate of return on two types of common stock may be the same, but there may be more variation in the rate of return in one than the other. A sample of 10 technology and 10 utility stocks shows the same mean rate of return, but there is likely more variation in the Internet stocks. l A study by the marketing department for a large newspaper found that men and women spent about the same amount of time per day reading the paper. However, the same report indicated there was nearly twice as much variation in time spent per day among the men than the women.
  • 87. 5 Test for Equal Variances
  • 88. 6 Test for Equal Variances - Example Lammers Limos offers limousine service from the city hall in Toledo, Ohio, to Metro Airport in Detroit. Sean Lammers, president of the company, is considering two routes. One is via U.S. 25 and the other via I-75. He wants to study the time it takes to drive to the airport using each route and then compare the results. He collected the following sample data, which is reported in minutes. Using the .10 significance level, is there a difference in the variation in the driving times for the two routes?
  • 89. 7 Step 1: The hypotheses are: H0: σ1 2 = σ1 2 H1: σ1 2 ≠ σ1 2 Step 2: The significance level is .05. Step 3: The test statistic is the F distribution. Test for Equal Variances - Example
  • 90. 8 Step 4: State the decision rule. Reject H0 if F > Fa/2,v1,v2 F > F.05/2,7-1,8-1 F > F.025,6,7 Test for Equal Variances - Example
  • 91. 9 The decision is to reject the null hypothesis, because the computed F value (4.23) is larger than the critical value (3.87). We conclude that there is a difference in the variation of the travel times along the two routes. Step 5: Compute the value of F and make a decision Test for Equal Variances - Example
  • 92. 10 Test for Equal Variances – Excel Example
  • 93. 11 Comparing Means of Two or More Populations l The F distribution is also used for testing whether two or more sample means came from the same or equal populations. l Assumptions: – The sampled populations follow the normal distribution. – The populations have equal standard deviations. – The samples are randomly selected and are independent.
  • 94. 12 l The Null Hypothesis is that the population means are the same. The Alternative Hypothesis is that at least one of the means is different. l The Test Statistic is the F distribution. l The Decision rule is to reject the null hypothesis if F (computed) is greater than F (table) with numerator and denominator degrees of freedom. l Hypothesis Setup and Decision Rule: H0: µ1 = µ2 =…= µk H1: The means are not all equal Reject H0 if F > Fa,k-1,n-k Comparing Means of Two or More Populations
  • 95. 13 Analysis of Variance – F statistic l If there are k populations being sampled, the numerator degrees of freedom is k – 1. l If there are a total of n observations the denominator degrees of freedom is n – k. l The test statistic is computed by: ( ) ( ) k n SSE k SST F - - = 1
  • 96. 14 Joyce Kuhlman manages a regional financial center. She wishes to compare the productivity, as measured by the number of customers served, among three employees. Four days are randomly selected and the number of customers served by each employee is recorded. The results are: Comparing Means of Two or More Populations – Illustrative Example
  • 97. 15 Comparing Means of Two or More Populations – Illustrative Example
  • 98. 16 Recently a group of four major carriers joined in hiring Brunner Marketing Research, Inc., to survey recent passengers regarding their level of satisfaction with a recent flight. The survey included questions on ticketing, boarding, in-flight service, baggage handling, pilot communication, and so forth. Twenty-five questions offered a range of possible answers: excellent, good, fair, or poor. A response of excellent was given a score of 4, good a 3, fair a 2, and poor a 1. These responses were then totaled, so the total score was an indication of the satisfaction with the flight. Brunner Marketing Research, Inc., randomly selected and surveyed passengers from the four airlines. Comparing Means of Two or More Populations – Example Is there a difference in the mean satisfaction level among the four airlines? Use the .01 significance level.
  • 99. 17 Step 1: State the null and alternate hypotheses. H0: µE = µA = µT = µO H1: The means are not all equal Reject H0 if F > Fa,k-1,n-k Step 2: State the level of significance. The .01 significance level is stated in the problem. Step 3: Find the appropriate test statistic. Because we are comparing means of more than two groups, use the F statistic Comparing Means of Two or More Populations – Example
  • 100. 18 Step 4: State the decision rule. Reject H0 if F > Fa,k-1,n-k F > F01,4-1,22-4 F > F01,3,18 F > 5.801 Comparing Means of Two or More Populations – Example
  • 101. 19 Step 5: Compute the value of F and make a decision Comparing Means of Two or More Populations – Example
  • 102. 20 Comparing Means of Two or More Populations – Example
  • 104. 22 Computing SST The computed value of F is 8.99, which is greater than the critical value of 5.09, so the null hypothesis is rejected. Conclusion: The population means are not all equal. The mean scores are not the same for the four airlines; at this point we can only conclude there is a difference in the treatment means. We cannot determine which treatment groups differ or how many treatment groups differ.
  • 105. 23 Inferences About Treatment Means l When we reject the null hypothesis that the means are equal, we may want to know which treatment means differ. l One of the simplest procedures is through the use of confidence intervals.
  • 106. 24 Confidence Interval for the Difference Between Two Means l where t is obtained from the t table with degrees of freedom (n - k). l MSE = [SSE/(n - k)] ( ) X X t MSE n n 1 2 1 2 1 1 - ± + æ è ç ö ø ÷
  • 107. 25 From the previous example, develop a 95% confidence interval for the difference in the mean rating for Eastern and Ozark. Can we conclude that there is a difference between the two airlines’ ratings? The 95 percent confidence interval ranges from 10.46 up to 26.04. Both endpoints are positive; hence, we can conclude these treatment means differ significantly. That is, passengers on Eastern rated service significantly different from those on Ozark. Confidence Interval for the Difference Between Two Means - Example
  • 110. 28 Two-Way Analysis of Variance l For the two-factor ANOVA we test whether there is a significant difference between the treatment effect and whether there is a difference in the blocking effect. Let Br be the block totals (r for rows) l Let SSB represent the sum of squares for the blocks where: SSB B k X n r = é ë ê ù û ú - S S 2 2 ( )
  • 111. 29 WARTA, the Warren Area Regional Transit Authority, is expanding bus service from the suburb of Starbrick into the central business district of Warren. There are four routes being considered from Starbrick to downtown Warren: (1) via U.S. 6, (2) via the West End, (3) via the Hickory Street Bridge, and (4) via Route 59. WARTA conducted several tests to determine whether there was a difference in the mean travel times along the four routes. Because there will be many different drivers, the test was set up so each driver drove along each of the four routes. Next slide shows the travel time, in minutes, for each driver-route combination. At the .05 significance level, is there a difference in the mean travel time along the four routes? If we remove the effect of the drivers, is there a difference in the mean travel time? Two-Way Analysis of Variance - Example
  • 112. 30 Two-Way Analysis of Variance - Example
  • 113. 31 Step 1: State the null and alternate hypotheses. H0: µu = µw = µh = µr H1: The means are not all equal Reject H0 if F > Fa,k-1,n-k Step 2: State the level of significance. The .05 significance level is stated in the problem. Step 3: Find the appropriate test statistic. Because we are comparing means of more than two groups, use the F statistic Two-Way Analysis of Variance - Example
  • 114. 32 Step 4: State the decision rule. Reject H0 if F > Fa,v1,v2 F > F.05,k-1,n-k F > F.05,4-1,20-4 F > F.05,3,16 F > 2.482 Two-Way Analysis of Variance - Example
  • 115. 33
  • 116. 34
  • 117. 35 Using Excel to perform the calculations. The computed value of F is 2.482, so our decision is to not reject the null hypothesis. We conclude there is no difference in the mean travel time along the four routes. There is no reason to select one of the routes as faster than the other. Two-Way Analysis of Variance – Excel Example
  • 118. 36 Two-Way ANOVA with Interaction Interaction occurs if the combination of two factors has some effect on the variable under study, in addition to each factor alone. We refer to the variable being studied as the response variable. An everyday illustration of interaction is the effect of diet and exercise on weight. It is generally agreed that a person’s weight (the response variable) can be controlled with two factors, diet and exercise. Research shows that weight is affected by diet alone and that weight is affected by exercise alone. However, the general recommended method to control weight is based on the combined or interaction effect of diet and exercise.
  • 119. 37 Graphical Observation of Mean Times Our graphical observations show us that interaction effects are possible. The next step is to conduct statistical tests of hypothesis to further investigate the possible interaction effects. In summary, our study of travel times has several questions: l Is there really an interaction between routes and drivers? l Are the travel times for the drivers the same? l Are the travel times for the routes the same? Of the three questions, we are most interested in the test for interactions. To put it another way, does a particular route/driver combination result in significantly faster (or slower) driving times? Also, the results of the hypothesis test for interaction affect the way we analyze the route and driver questions.
  • 120. 38 Interaction Effect l We can investigate these questions statistically by extending the two-way ANOVA procedure presented in the previous section. We add another source of variation, namely, the interaction. l In order to estimate the “error” sum of squares, we need at least two measurements for each driver/route combination. l As example, suppose the experiment presented earlier is repeated by measuring two more travel times for each driver and route combination. That is, we replicate the experiment. Now we have three new observations for each driver/route combination. l Using the mean of three travel times for each driver/route combination we get a more reliable measure of the mean travel time.
  • 121. 39 Example – ANOVA with Replication
  • 122. 40 Three Tests in ANOVA with Replication The ANOVA now has three sets of hypotheses to test: 1. H0: There is no interaction between drivers and routes. H1: There is interaction between drivers and routes. 2. H0: The driver means are the same. H1: The driver means are not the same. 3. H0: The route means are the same. H1: The route means are not the same.
  • 125. 43
  • 127. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13
  • 128. 2 GOALS l Understand and interpret the terms dependent and independent variable. l Calculate and interpret the coefficient of correlation, the coefficient of determination, and the standard error of estimate. l Conduct a test of hypothesis to determine whether the coefficient of correlation in the population is zero. l Calculate the least squares regression line. l Construct and interpret confidence and prediction intervals for the dependent variable.
  • 129. 3 Regression Analysis - Introduction l Recall in Chapter 4 the idea of showing the relationship between two variables with a scatter diagram was introduced. l In that case we showed that, as the age of the buyer increased, the amount spent for the vehicle also increased. l In this chapter we carry this idea further. Numerical measures to express the strength of relationship between two variables are developed. l In addition, an equation is used to express the relationship. between variables, allowing us to estimate one variable on the basis of another.
  • 130. 4 Regression Analysis - Uses Some examples. l Is there a relationship between the amount Healthtex spends per month on advertising and its sales in the month? l Can we base an estimate of the cost to heat a home in January on the number of square feet in the home? l Is there a relationship between the miles per gallon achieved by large pickup trucks and the size of the engine? l Is there a relationship between the number of hours that students studied for an exam and the score earned?
  • 131. 5 Correlation Analysis l Correlation Analysis is the study of the relationship between variables. It is also defined as group of techniques to measure the association between two variables. l A Scatter Diagram is a chart that portrays the relationship between the two variables. It is the usual first step in correlations analysis – The Dependent Variable is the variable being predicted or estimated. – The Independent Variable provides the basis for estimation. It is the predictor variable.
  • 132. 6 Regression Example The sales manager of Copier Sales of America, which has a large sales force throughout the United States and Canada, wants to determine whether there is a relationship between the number of sales calls made in a month and the number of copiers sold that month. The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made last month and the number of copiers sold.
  • 134. 8 The Coefficient of Correlation, r The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables. It requires interval or ratio-scaled data. l It can range from -1.00 to 1.00. l Values of -1.00 or 1.00 indicate perfect and strong correlation. l Values close to 0.0 indicate weak correlation. l Negative values indicate an inverse relationship and positive values indicate a direct relationship.
  • 137. 11 Correlation Coefficient - Interpretation
  • 139. 13 Coefficient of Determination The coefficient of determination (r2) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). It is the square of the coefficient of correlation. l It ranges from 0 to 1. l It does not give any information on the direction of the relationship between the variables.
  • 140. 14 Using the Copier Sales of America data which a scatterplot was developed earlier, compute the correlation coefficient and coefficient of determination. Correlation Coefficient - Example
  • 143. 17 How do we interpret a correlation of 0.759? First, it is positive, so we see there is a direct relationship between the number of sales calls and the number of copiers sold. The value of 0.759 is fairly close to 1.00, so we conclude that the association is strong. However, does this mean that more sales calls cause more sales? No, we have not demonstrated cause and effect here, only that the two variables—sales calls and copiers sold—are related. Correlation Coefficient - Example
  • 144. 18 Coefficient of Determination (r2) - Example •The coefficient of determination, r2 ,is 0.576, found by (0.759)2 •This is a proportion or a percent; we can say that 57.6 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls.
  • 145. 19 Testing the Significance of the Correlation Coefficient H0: r = 0 (the correlation in the population is 0) H1: r ≠ 0 (the correlation in the population is not 0) Reject H0 if: t > ta/2,n-2 or t < -ta/2,n-2
  • 146. 20 Testing the Significance of the Correlation Coefficient - Example H0: r = 0 (the correlation in the population is 0) H1: r ≠ 0 (the correlation in the population is not 0) Reject H0 if: t > ta/2,n-2 or t < -ta/2,n-2 t > t0.025,8 or t < -t0.025,8 t > 2.306 or t < -2.306
  • 147. 21 Testing the Significance of the Correlation Coefficient - Example The computed t (3.297) is within the rejection region, therefore, we will reject H0. This means the correlation in the population is not zero. From a practical standpoint, it indicates to the sales manager that there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople.
  • 150. 24 Computing the Slope of the Line
  • 152. 26 Regression Analysis In regression analysis we use the independent variable (X) to estimate the dependent variable (Y). l The relationship between the variables is linear. l Both variables must be at least interval scale. l The least squares criterion is used to determine the equation.
  • 153. 27 Regression Analysis – Least Squares Principle l The least squares principle is used to obtain a and b. l The equations to determine a and b are: b n XY X Y n X X a Y n b X n = - - = - ( ) ( )( ) ( ) ( ) S S S S S S S 2 2
  • 154. 28 Illustration of the Least Squares Regression Principle
  • 155. 29 Regression Equation - Example Recall the example involving Copier Sales of America. The sales manager gathered information on the number of sales calls made and the number of copiers sold for a random sample of 10 sales representatives. Use the least squares method to determine a linear equation to express the relationship between the two variables. What is the expected number of copiers sold by a representative who made 20 calls?
  • 156. 30 Finding the Regression Equation - Example 6316 . 42 ) 20 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 : is equation regression The ^ ^ ^ ^ = + = + = + = Y Y X Y bX a Y
  • 157. 31 Computing the Estimates of Y Step 1 – Using the regression equation, substitute the value of each X to solve for the estimated sales 4736 . 54 ) 30 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 Jones Soni ^ ^ ^ = + = + = Y Y X Y 6316 . 42 ) 20 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 Keller Tom ^ ^ ^ = + = + = Y Y X Y
  • 158. 32 Plotting the Estimated and the Actual Y’s
  • 159. 33 The Standard Error of Estimate l The standard error of estimate measures the scatter, or dispersion, of the observed values around the line of regression l The formulas that are used to compute the standard error: 2 ) ( 2 ^ . - - S = n Y Y s x y 2 2 . - S - S - S = n XY b Y a Y s x y
  • 160. 34 Standard Error of the Estimate - Example Recall the example involving Copier Sales of America. The sales manager determined the least squares regression equation is given below. Determine the standard error of estimate as a measure of how well the values fit the regression line. X Y 1842 . 1 9476 . 18 ^ + = 901 . 9 2 10 211 . 784 2 ) ( 2 ^ . = - = - - S = n Y Y s x y
  • 161. 35 ) ( ^ Y Y - Graphical Illustration of the Differences between Actual Y – Estimated Y
  • 162. 36 Standard Error of the Estimate - Excel
  • 163. 37 Assumptions Underlying Linear Regression For each value of X, there is a group of Y values, and these l Y values are normally distributed. The means of these normal distributions of Y values all lie on the straight line of regression. l The standard deviations of these normal distributions are equal. l The Y values are statistically independent. This means that in the selection of a sample, the Y values chosen for a particular X value do not depend on the Y values for any other X values.
  • 164. 38 Confidence Interval and Prediction Interval Estimates of Y •A confidence interval reports the mean value of Y for a given X. •A prediction interval reports the range of values of Y for a particular value of X.
  • 165. 39 Confidence Interval Estimate - Example We return to the Copier Sales of America illustration. Determine a 95 percent confidence interval for all sales representatives who make 25 calls.
  • 166. 40 Step 1 – Compute the point estimate of Y In other words, determine the number of copiers we expect a sales representative to sell if he or she makes 25 calls. 5526 . 48 ) 25 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 : is equation regression The ^ ^ ^ = + = + = Y Y X Y Confidence Interval Estimate - Example
  • 167. 41 Step 2 – Find the value of t l To find the t value, we need to first know the number of degrees of freedom. In this case the degrees of freedom is n - 2 = 10 – 2 = 8. l We set the confidence level at 95 percent. To find the value of t, move down the left-hand column of Appendix B.2 to 8 degrees of freedom, then move across to the column with the 95 percent level of confidence. l The value of t is 2.306. Confidence Interval Estimate - Example
  • 169. 43 Confidence Interval Estimate - Example Step 4 – Use the formula above by substituting the numbers computed in previous slides Thus, the 95 percent confidence interval for the average sales of all sales representatives who make 25 calls is from 40.9170 up to 56.1882 copiers.
  • 170. 44 Prediction Interval Estimate - Example We return to the Copier Sales of America illustration. Determine a 95 percent prediction interval for Sheila Baker, a West Coast sales representative who made 25 calls.
  • 171. 45 Step 1 – Compute the point estimate of Y In other words, determine the number of copiers we expect a sales representative to sell if he or she makes 25 calls. 5526 . 48 ) 25 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 : is equation regression The ^ ^ ^ = + = + = Y Y X Y Prediction Interval Estimate - Example
  • 172. 46 Step 2 – Using the information computed earlier in the confidence interval estimation example, use the formula above. Prediction Interval Estimate - Example If Sheila Baker makes 25 sales calls, the number of copiers she will sell will be between about 24 and 73 copiers.
  • 173. 47 Confidence and Prediction Intervals – Minitab Illustration
  • 174. 48 Transforming Data l The coefficient of correlation describes the strength of the linear relationship between two variables. It could be that two variables are closely related, but there relationship is not linear. l Be cautious when you are interpreting the coefficient of correlation. A value of r may indicate there is no linear relationship, but it could be there is a relationship of some other nonlinear or curvilinear form.
  • 175. 49 Transforming Data - Example On the right is a listing of 22 professional golfers, the number of events in which they participated, the amount of their winnings, and their mean score for the 2004 season. In golf, the objective is to play 18 holes in the least number of strokes. So, we would expect that those golfers with the lower mean scores would have the larger winnings. To put it another way, score and winnings should be inversely related. In 2004 Tiger Woods played in 19 events, earned $5,365,472, and had a mean score per round of 69.04. Fred Couples played in 16 events, earned $1,396,109, and had a mean score per round of 70.92. The data for the 22 golfers follows.
  • 176. 50 Scatterplot of Golf Data l The correlation between the variables Winnings and Score is 0.782. This is a fairly strong inverse relationship. l However, when we plot the data on a scatter diagram the relationship does not appear to be linear; it does not seem to follow a straight line.
  • 177. 51 What can we do to explore other (nonlinear) relationships? One possibility is to transform one of the variables. For example, instead of using Y as the dependent variable, we might use its log, reciprocal, square, or square root. Another possibility is to transform the independent variable in the same way. There are other transformations, but these are the most common.
  • 178. 52 In the golf winnings example, changing the scale of the dependent variable is effective. We determine the log of each golfer’s winnings and then find the correlation between the log of winnings and score. That is, we find the log to the base 10 of Tiger Woods’ earnings of $5,365,472, which is 6.72961. Transforming Data - Example
  • 179. 53 Scatter Plot of Transformed Y
  • 180. 54 Linear Regression Using the Transformed Y
  • 181. 55 Using the Transformed Equation for Estimation Based on the regression equation, a golfer with a mean score of 70 could expect to earn: •The value 6.4372 is the log to the base 10 of winnings. •The antilog of 6.4372 is 2.736 •So a golfer that had a mean score of 70 could expect to earn $2,736,528.
  • 183. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin Multiple Linear Regression and Correlation Analysis Chapter 14
  • 184. 2 GOALS l Describe the relationship between several independent variables and a dependent variable using multiple regression analysis. l Set up, interpret, and apply an ANOVA table l Compute and interpret the multiple standard error of estimate, the coefficient of multiple determination, and the adjusted coefficient of multiple determination. l Conduct a test of hypothesis to determine whether regression coefficients differ from zero. l Conduct a test of hypothesis on each of the regression coefficients. l Use residual analysis to evaluate the assumptions of multiple regression analysis. l Evaluate the effects of correlated independent variables. l Use and understand qualitative independent variables. l Understand and interpret the stepwise regression method. l Understand and interpret possible interaction among independent variables.
  • 185. 3 Multiple Regression Analysis The general multiple regression with k independent variables is given by: The least squares criterion is used to develop this equation. Because determining b1, b2, etc. is very tedious, a software package such as Excel or MINITAB is recommended.
  • 186. 4 Multiple Regression Analysis For two independent variables, the general form of the multiple regression equation is: •X1 and X2 are the independent variables. •a is the Y-intercept •b1 is the net change in Y for each unit change in X1 holding X2 constant. It is called a partial regression coefficient, a net regression coefficient, or just a regression coefficient.
  • 187. 5 Regression Plane for a 2-Independent Variable Linear Regression Equation
  • 188. 6 Salsberry Realty sells homes along the east coast of the United States. One of the questions most frequently asked by prospective buyers is: If we purchase this home, how much can we expect to pay to heat it during the winter? The research department at Salsberry has been asked to develop some guidelines regarding heating costs for single-family homes. Three variables are thought to relate to the heating costs: (1) the mean daily outside temperature, (2) the number of inches of insulation in the attic, and (3) the age in years of the furnace. To investigate, Salsberry’s research department selected a random sample of 20 recently sold homes. It determined the cost to heat each home last January, as well Multiple Linear Regression - Example
  • 190. 8 Multiple Linear Regression – Minitab Example
  • 191. 9 Multiple Linear Regression – Excel Example
  • 192. 10 The Multiple Regression Equation – Interpreting the Regression Coefficients The regression coefficient for mean outside temperature is 4.583. The coefficient is negative and shows an inverse relationship between heating cost and temperature. As the outside temperature increases, the cost to heat the home decreases. The numeric value of the regression coefficient provides more information. If we increase temperature by 1 degree and hold the other two independent variables constant, we can estimate a decrease of $4.583 in monthly heating cost. So if the mean temperature in Boston is 25 degrees and it is 35 degrees in Philadelphia, all other things being the same (insulation and age of furnace), we expect the heating cost would be $45.83 less in Philadelphia. The attic insulation variable also shows an inverse relationship: the more insulation in the attic, the less the cost to heat the home. So the negative sign for this coefficient is logical. For each additional inch of insulation, we expect the cost to heat the home to decline $14.83 per month, regardless of the outside temperature or the age of the furnace. The age of the furnace variable shows a direct relationship. With an older furnace, the cost to heat the home increases. Specifically, for each additional year older the furnace is, we expect the cost to increase $6.10 per month.
  • 193. 11 Applying the Model for Estimation What is the estimated heating cost for a home if the mean outside temperature is 30 degrees, there are 5 inches of insulation in the attic, and the furnace is 10 years old?
  • 194. 12 Multiple Standard Error of Estimate The multiple standard error of estimate is a measure of the effectiveness of the regression equation. l It is measured in the same units as the dependent variable. l It is difficult to determine what is a large value and what is a small value of the standard error. l The formula is:
  • 195. 13
  • 196. 14 Multiple Regression and Correlation Assumptions l The independent variables and the dependent variable have a linear relationship. The dependent variable must be continuous and at least interval- scale. l The residual must be the same for all values of Y. When this is the case, we say the difference exhibits homoscedasticity. l The residuals should follow the normal distributed with mean 0. l Successive values of the dependent variable must be uncorrelated.
  • 197. 15 The ANOVA Table The ANOVA table reports the variation in the dependent variable. The variation is divided into two components. l The Explained Variation is that accounted for by the set of independent variable. l The Unexplained or Random Variation is not accounted for by the independent variables.
  • 198. 16 Minitab – the ANOVA Table
  • 199. 17 Coefficient of Multiple Determination (r2) Characteristics of the coefficient of multiple determination: 1. It is symbolized by a capital R squared. In other words, it is written as because it behaves like the square of a correlation coefficient. 2. It can range from 0 to 1. A value near 0 indicates little association between the set of independent variables and the dependent variable. A value near 1 means a strong association. 3. It cannot assume negative values. Any number that is squared or raised to the second power cannot be negative. 4. It is easy to interpret. Because is a value between 0 and 1 it is easy to interpret, compare, and understand.
  • 200. 18 Minitab – the ANOVA Table 804 . 0 916 , 212 220 , 171 total 2 = = = SS SSR R
  • 201. 19 Adjusted Coefficient of Determination l The number of independent variables in a multiple regression equation makes the coefficient of determination larger. Each new independent variable causes the predictions to be more accurate. l If the number of variables, k, and the sample size, n, are equal, the coefficient of determination is 1.0. In practice, this situation is rare and would also be ethically questionable. l To balance the effect that the number of independent variables has on the coefficient of multiple determination, statistical software packages use an adjusted coefficient of multiple determination.
  • 202. 20
  • 203. 21 Correlation Matrix A correlation matrix is used to show all possible simple correlation coefficients among the variables. l The matrix is useful for locating correlated independent variables. l It shows how strongly each independent variable is correlated with the dependent variable.
  • 204. 22 Global Test: Testing the Multiple Regression Model The global test is used to investigate whether any of the independent variables have significant coefficients. The hypotheses are: 0 equal s all Not : 0 ... : 1 2 1 0 b b b b H H k = = = =
  • 205. 23 Global Test continued l The test statistic is the F distribution with k (number of independent variables) and n-(k+1) degrees of freedom, where n is the sample size. l Decision Rule: Reject H0 if F > Fa,k,n-k-1
  • 208. 26 Interpretation l The computed value of F is 21.90, which is in the rejection region. l The null hypothesis that all the multiple regression coefficients are zero is therefore rejected. l Interpretation: some of the independent variables (amount of insulation, etc.) do have the ability to explain the variation in the dependent variable (heating cost). l Logical question – which ones?
  • 209. 27 Evaluating Individual Regression Coefficients (βi = 0) l This test is used to determine which independent variables have nonzero regression coefficients. l The variables that have zero regression coefficients are usually dropped from the analysis. l The test statistic is the t distribution with n-(k+1) degrees of freedom. l The hypothesis test is as follows: H0: βi = 0 H1: βi ≠ 0 Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
  • 210. 28 Critical t-stat for the Slopes -2.120 2.120
  • 211. 29 Computed t-stat for the Slopes
  • 213. 31 New Regression Model without Variable “Age” – Minitab
  • 214. 32 New Regression Model without Variable “Age” – Minitab
  • 215. 33 Testing the New Model for Significance
  • 216. 34 Critical t-stat for the New Slopes 110 . 2 0 110 . 2 0 0 0 0 0 0 0 : if H Reject 17 , 025 . 17 , 025 . 1 2 20 , 2 / 05 . 1 2 20 , 2 / 05 . 1 , 2 / 1 , 2 / 1 , 2 / 1 , 2 / 0 - < - > - - < - > - - < - > - - < - > - - < > - - - - - - - - - - - - i i i i i i i i b i b i b i b i b i b i k n b i k n b i k n k n s b s b t s b t s b t s b t s b t s b t s b t t t t a a a a -2.110 2.110
  • 218. 36 Evaluating the Assumptions of Multiple Regression 1. There is a linear relationship. That is, there is a straight-line relationship between the dependent variable and the set of independent variables. 2. The variation in the residuals is the same for both large and small values of the estimated Y To put it another way, the residual is unrelated whether the estimated Y is large or small. 3. The residuals follow the normal probability distribution. 4. The independent variables should not be correlated. That is, we would like to select a set of independent variables that are not themselves correlated. 5. The residuals are independent. This means that successive observations of the dependent variable are not correlated. This assumption is often violated when time is involved with the sampled observations.
  • 219. 37 Analysis of Residuals A residual is the difference between the actual value of Y and the predicted value of Y. Residuals should be approximately normally distributed. Histograms and stem-and-leaf charts are useful in checking this requirement. l A plot of the residuals and their corresponding Y’ values is used for showing that there are no trends or patterns in the residuals.
  • 222. 40 Distribution of Residuals Both MINITAB and Excel offer another graph that helps to evaluate the assumption of normally distributed residuals. It is a called a normal probability plot and is shown to the right of the histogram.
  • 223. 41 Multicollinearity l Multicollinearity exists when independent variables (X’s) are correlated. l Correlated independent variables make it difficult to make inferences about the individual regression coefficients (slopes) and their individual effects on the dependent variable (Y). l However, correlated independent variables do not affect a multiple regression equation’s ability to predict the dependent variable (Y).
  • 224. 42 Variance Inflation Factor l A general rule is if the correlation between two independent variables is between -0.70 and 0.70 there likely is not a problem using both of the independent variables. l A more precise test is to use the variance inflation factor (VIF). l The value of VIF is found as follows: •The term R2 j refers to the coefficient of determination, where the selected independent variable is used as a dependent variable and the remaining independent variables are used as independent variables. •A VIF greater than 10 is considered unsatisfactory, indicating that independent variable should be removed from the analysis.
  • 225. 43 Multicollinearity – Example Refer to the data in the table, which relates the heating cost to the independent variables outside temperature, amount of insulation, and age of furnace. Develop a correlation matrix for all the independent variables. Does it appear there is a problem with multicollinearity? Find and interpret the variance inflation factor for each of the independent variables.
  • 227. 45 VIF – Minitab Example The VIF value of 1.32 is less than the upper limit of 10. This indicates that the independent variable temperature is not strongly correlated with the other independent variables. Coefficient of Determination
  • 228. 46 Independence Assumption l The fifth assumption about regression and correlation analysis is that successive residuals should be independent. l When successive residuals are correlated we refer to this condition as autocorrelation. Autocorrelation frequently occurs when the data are collected over a period of time.
  • 229. 47 Residual Plot versus Fitted Values l The graph below shows the residuals plotted on the vertical axis and the fitted values on the horizontal axis. l Note the run of residuals above the mean of the residuals, followed by a run below the mean. A scatter plot such as this would indicate possible autocorrelation.
  • 230. 48 Qualitative Independent Variables l Frequently we wish to use nominal-scale variables—such as gender, whether the home has a swimming pool, or whether the sports team was the home or the visiting team—in our analysis. These are called qualitative variables. l To use a qualitative variable in regression analysis, we use a scheme of dummy variables in which one of the two possible conditions is coded 0 and the other 1.
  • 231. 49 Qualitative Variable - Example Suppose in the Salsberry Realty example that the independent variable “garage” is added. For those homes without an attached garage, 0 is used; for homes with an attached garage, a 1 is used. We will refer to the “garage” variable as The data from Table 14–2 are entered into the MINITAB system.
  • 233. 51 Using the Model for Estimation What is the effect of the garage variable? Suppose we have two houses exactly alike next to each other in Buffalo, New York; one has an attached garage, and the other does not. Both homes have 3 inches of insulation, and the mean January temperature in Buffalo is 20 degrees. For the house without an attached garage, a 0 is substituted for in the regression equation. The estimated heating cost is $280.90, found by: For the house with an attached garage, a 1 is substituted for in the regression equation. The estimated heating cost is $358.30, found by: Without garage With garage
  • 234. 52 Testing the Model for Significance l We have shown the difference between the two types of homes to be $77.40, but is the difference significant? l We conduct the following test of hypothesis. H0: βi = 0 H1: βi ≠ 0 Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
  • 235. 53 Evaluating Individual Regression Coefficients (βi = 0) l This test is used to determine which independent variables have nonzero regression coefficients. l The variables that have zero regression coefficients are usually dropped from the analysis. l The test statistic is the t distribution with n-(k+1) or n-k-1degrees of freedom. l The hypothesis test is as follows: H0: βi = 0 H1: βi ≠ 0 Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
  • 237. 55 Stepwise Regression The advantages to the stepwise method are: 1. Only independent variables with significant regression coefficients are entered into the equation. 2. The steps involved in building the regression equation are clear. 3. It is efficient in finding the regression equation with only significant regression coefficients. 4. The changes in the multiple standard error of estimate and the coefficient of determination are shown.
  • 238. 56 The stepwise MINITAB output for the heating cost problem follows. Temperature is selected first. This variable explains more of the variation in heating cost than any of the other three proposed independent variables. Garage is selected next, followed by Insulation. Stepwise Regression – Minitab Example
  • 239. 57 Regression Models with Interaction l In Chapter 12 we discussed interaction among independent variables. To explain, suppose we are studying weight loss and assume, as the current literature suggests, that diet and exercise are related. So the dependent variable is amount of change in weight and the independent variables are: diet (yes or no) and exercise (none, moderate, significant). We are interested in whether there is interaction among the independent variables. That is, if those studied maintain their diet and exercise significantly, will that increase the mean amount of weight lost? Is total weight loss more than the sum of the loss due to the diet effect and the loss due to the exercise effect? l In regression analysis, interaction can be examined as a separate independent variable. An interaction prediction variable can be developed by multiplying the data values in one independent variable by the values in another independent variable, thereby creating a new independent variable. A two-variable model that includes an interaction term is:
  • 240. 58 Refer to the heating cost example. Is there an interaction between the outside temperature and the amount of insulation? If both variables are increased, is the effect on heating cost greater than the sum of savings from warmer temperature and the savings from increased insulation separately? Regression Models with Interaction - Example
  • 241. 59 Creating the Interaction Variable – Using the information from the table in the previous slide, an interaction variable is created by multiplying the temperature variable by the insulation. For the first sampled home the value temperature is 35 degrees and insulation is 3 inches so the value of the interaction variable is 35 X 3 = 105. The values of the other interaction products are found in a similar fashion. Regression Models with Interaction - Example
  • 242. 60 Regression Models with Interaction - Example
  • 243. 61 The regression equation is: Is the interaction variable significant at 0.05 significance level? Regression Models with Interaction - Example
  • 244. 62 There are other situations that can occur when studying interaction among independent variables. 1. It is possible to have a three-way interaction among the independent variables. In the heating example, we might have considered the three-way interaction between temperature, insulation, and age of the furnace. 2. It is possible to have an interaction where one of the independent variables is nominal scale. In our heating cost example, we could have studied the interaction between temperature and garage.