1. Tyler Anton
1
Spring 2014
Problem Set #3
Hypothesis Testing
1. University of Maryland University College is concerned that out of state students may be
receiving lower grades than Maryland students. Two independent random samples have been
selected: 165 observations from population 1 (Out of state students) and 177 from population 2
(Maryland students). The sample means obtained are X1(bar)=86 and X2(bar)=87. It is known
from previous studies that the population variances are 8.1 and 7.3 respectively. Using a level of
significance of .01, is there evidence that the out of state students may be receiving lower
grades? Fully explain your answer.
H0: 1 > 2
H1: 1 < 2 [Rejection Region in lower (left) tail]
Level of Significance = 0.01 @ one-tailed test (Appendix B.5)
*Critical Value (infinite df) = (-) 2.326; less than = (-) Critical Value via one-tail; Rejection
Region in lower (left) tail
Thus, reject H0 if z < - 2.326
Population Variance = 1^2
Z = (86-87) / SQRT [(8.1/165) + (7.3/177)]
Z = (-1/0.3005558965)
Z = -3.327168129
Explanation
The Z test statistic (-3.327) is lower than the critical value (-2.326) and the one-tail rejection
region is pointing towards the left (lower tail). This implies that we reject H0, and accept H1.
Thus, there is evidence that out-of-state students receive lower grades than Maryland students.
Reject Ho if P-value < Level of significance (0.01)
*P-value = [0.5 – 0.4990] = 0.0010; Thus, reject H0; small likelihood Ho is true
*0.4990 derived from Appendix 3.B; Area under the curve corresponding to 3.327 is 0.4990
2. Tyler Anton
2
Simple Regression
2. A CEO of a large pharmaceutical company would like to determine if the company should
be placing more money allotted in the budget next year for television advertising of a new drug
marketed for controlling diabetes. He wonders whether there is a strong relationship between the
amount of money spent on television advertising for this new drug called DIB and the number of
orders received. The manufacturing process of this drug is very difficult and requires stability so
the CEO would prefer to generate a stable number of orders. The cost of advertising is always an
important consideration in the phase I roll-out of a new drug. Data that have been collected over
the past 20 months indicate the amount of money spent of television advertising and the number
of orders received.
The use of linear regression is a critical tool for a manager's decision-making ability.
Please carefully read the example below and try to answer the questions in terms of the problem
context. The results are as follows:
NOTE: If you do not have the Data Analysis option under Tools you must install it. You need
to go to Tools select Add-ins and then choose the 2 data toolpak options. It should take about a
minute.
Month Advertising Cost Number of Orders
1 $74,430.00
2,856,000
2 62,620 1,800,000
3 67,580 1,299,000
4 53,680 1,510,000
5 69,180 1,367,000
6 73,140 2,611,000
7 85,370 3,788,000
8 76,880 2,935,000
9 66,990 1,955,000
10 77,230 3,634,000
11 61,380 1,598,000
12 62,750 1,867,000
13 63,270 1,899,000
14 86,190 3,245,000
3. Tyler Anton
3
15 60,030 1,934,000
16 79,210 2,761,000
17 67,770 1,625,000
18 84,530 3,778,000
19 79,760 2,979,000
20 84,640 3,814,000
a. Set up a scatter diagram and calculate the associated correlation
coefficient. Discuss how strong you think the relationship is between the
amount of money spent on television advertising and the number of orders
received.
Please use the Correlation procedures within Excel under Tools > Data Analysis.
Implication: The number of orders received is related to the advertising costs/budget.
Dependent Variable = [Number of Orders]
Independent Variable = [Advertising Costs]
y = 0.0097x + 47895
R² = 0.776
$0
$20,000
$40,000
$60,000
$80,000
$100,000
1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000
AdvertisingCosts(y)
Orders Received (x)
Advertising Cost & Orders Received Comparison
4. Tyler Anton
4
Correlation Coefficient (r) 0.880931435
The scatter plot and correlation coefficient (r) of 0.8809 indicates that there is a strong positive
correlation. A value of (r) near 1 indicates a direct or positive linear relationship between the two
variables – advertising costs and number of orders. As advertising costs increase, the number of
orders received will follow. A positive correlation exists. So far, the CEO should consider
increasing the advertising budget. There is a relatively direct or strong relationship between the
amount of money spent on television advertising for this new drug, called DIB, and the number
of orders received.
b. Assuming there is a statistically significant relationship, use the least squares method to
find the regression equation to predict the advertising costs based on the number of orders
received. Please use the regression procedure within Excel under Tools > Data Analysis to
construct this equation.
Least Squares Regression Equation: y = 0.00971950x + 47895
R2
= 0.776
c. Interpret the meaning of the slope, b1, in the regression equation.
The coefficient for the ‘Number of Orders Received’ (x) is 0.00971950. For every increase in the
firm’s ‘Number of Orders Received’, there is an anticipated 0.00971950 increase in ‘Advertising
Costs’ respectively - (Just under 1 cent)
B. Regression
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.880931435
R Square 0.776040194
Adjusted R Square0.763597982
Standard Error4704.512237
Observations 20
ANOVA
df SS MS F Significance F
Regression 1 1380434618 1380434618 62.3715644 2.943E-07
Residual 18 398383837 22132435.39
Total 19 1778818455
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 47894.77763 3208.26531 14.92855891 1.3962E-11 41154.4623 54635.0929 41154.4623 54635.0929
X Variable 1 0.00971951 0.0012307 7.897566989 2.943E-07 0.00713391 0.01230511 0.00713391 0.01230511
Note that R Squared here is the same (.776) as we got on the chart.
Also the equation coefficients are identical (47895 and .00971)
5. Tyler Anton
5
d. Predict the monthly advertising cost when the number of orders is 2,300,000. (Hint: Be very
careful with assigning the dependent variable for this problem)
y = dependent variable being estimated. In part d, Advertising Costs are forecasted; hence,
Advertising Costs are the dependent variable.
y = 0.00971950x + 47895
y (Advertising Costs) = 0.00971950(2300000) + 47895
Monthly Advertising Cost (When x = 2,300,000 orders): $70,250
e. Compute the coefficient of determination, r2
, and interpret its meaning.
R2
= 0.776 = % of Total variation (SS Total) explained by the regression equation (SSR)
77.6% of the total variation in Advertising Costs (y) is explained by the number of orders
received (x). Thus, the data is scattered around the best least squares regression line and there
will be error in the predictions – actual vs. predicted (y)’s.
22.4% of the total variation in the dependent variable is error/residual (Unexplained)
variation - standard deviation or dispersion of actual (y)’s from the predicted (y)’s on the linear
regression line.
f. Compute the standard error of estimate, and interpret its meaning.
Sy.x = standard error for y (advertising costs – depend.) for a given value of x (number of orders).
Sy.x OR STEYX = 4704.51; or [4704.51/1000] = 4.70451 {Simplified}
The standard error of a predicted y-value for each x in the regression is 4.70451
(simplified). This implies the standard error for our forecasted monthly advertising costs is
4.70451.
The predicted dependent variable is located at an x-value corresponding to the regression
line; however, an actual data point may be above or below that line.
Standard error of estimate (SEE): A measure of how inaccurate an estimate might be. It is
essentially the standard deviation or dispersion of actual (y)’s from the predicted (y)’s on
the linear regression line. This is a measure of how well regression line represents the scattered
data. The SEE is the standard deviation of the errors (or residuals). More simply put, the
difference between the actual (y) and the predicted (y) is the error or residual.
The greater the dispersion, the larger the SEE. A larger sample size could be used to
reduce the SEE.
6. Tyler Anton
6
scatter/dispersion of the observed values around the line of regression for a given value of (x)
g. Do you think that the company should use these results from the regression to base any
corporate decisions on?….explain fully.
Yes.
SEE & r2
are the best measures to evaluate the predictive ability of the regression equation.
The scatter plot and correlation coefficient (r) of 0.8809 indicates that there is a strong positive
correlation. A value of (r) near 1 indicates a direct or positive linear relationship between the two
variables – advertising costs and number of orders. This (r) indicates that there is a very strong
predictive model.
As for r2
, 77.6% of the variation in Advertising Costs (y) is explained by the number of orders
received (x). However, 22.4% of the total variation in the dependent variable is error/residual
(unexplained) variation - standard deviation or dispersion of actual (y)’s from the predicted (y)’s
on the linear regression line.
The standard error of a predicted y-value for each x in the regression is 4.70451
(simplified). This implies the standard error for our forecasted monthly advertising costs is
4.70451 – quite small considering the following:
The correlation coefficient is large (0.8809) since the scattered points tend to be close to the
linear regression line. The correlation coefficient and SEE are inversely related. Thus, as
the strength of the linear relationship between the 2 variables increases, the SEE decreases.
Due to high correlation between the independent and dependent variables, there is less
erratic scatter/dispersion - indicating the regression equation is sufficient and accounts for
over 2/3rds of total variation. A larger sample size, however, such as 3 or 4 years of data,
could be used to reduce this SEE.
This regression model can be used to predict future values with great certainty; high
degree of statistical significance.
7. Tyler Anton
7
Hypothesis Testing on Multiple Populations
3. Dr. Michaella Evans, a statistics professor at the University of Maryland University College,
drives from her home to the school every weekday. She has three options to drive there. She can
take the Beltway, or she can take a main highway with some traffic lights, or she can take the
back road, which has no traffic lights but is a longer distance. Being as data-oriented as she is,
she is interested to know if there is a difference in the time it takes to drive each route.
As an experiment she randomly selected the route on 21 different days and wrote down the time
it took her for the round trip, getting to work in the morning and back home in the evening.
At the .01 significance level, can she conclude that there is a difference between the driving
times using the different routes?
Time (in minutes) it took to get to work and back using:
Beltway
Main highway Back road
88 79 86
94 86 78
91 75 79
88 83 96
98 74 97
84 72 73
90 68
77
You can check your critical value with the following table:
http://www.statsoft.com/textbook/distribution-tables
Pg 391 & 751
H0: 1=2=3
H1: The mean scores are not equal
Level of Significance = 0.01
Test Statistic = F distribution
df in numerator = (k-1) or 3-1 = 2
df in denominator = (n-k) or 21-3 = 18
Appendix B.6 @ 0.01 F dist = 6.013 (intersection value); Reject H0 if computed F>6.013
Reject Ho if P-value < Level of significance (0.01)
Reject Ho if F > 6.0129
According to the Anova data analysis below, F<6.013 and P-value (0.071) > Level of
significance (0.01). Thus, we reject H1 and conclude that there is NOT a difference between the
driving times using the different routes. This P-value indicates that there is a high probability that
if we rejected H0, we would have committed a type 1 error.
8. Tyler Anton
8
Since 3.0683<6.0129 we can conclude that the null hypothesis Ho should not be rejected. There
is enough evidence to conclude that there is no difference in the driving times between the three
routes
Anova: Single Factor (Single Driver,
not multiple like in Two-Factor W/O
Replication on pg 402)
SUMMARY
Groups Count Sum Average Variance
Beltway 8 710 88.75 40.21429
Main highway 6 469 78.16667 30.16667
Back road 7 577 82.42857 122.9524
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 398.9047619 2 199.4524 3.068373 0.071341785 6.012905
Within Groups 1170.047619 18 65.00265
Total 1568.952381 20