Analysis of Boston Housing Data Models

1
ysstats@bu.edu U37074009

Analysis of the Boston Housing Data
from the 1970 census:
Diverse Tests and Model Selection Processes regarding the
Variables in Boston Housing Data
Shuai Yuan1
December 8, 2016
Abstract
In this project, we study the Boston Housing Data that was offered by Harrison and Rubinfeld,
1978. The data contained many different variables that related to Boston Housing for 506 tracts of
Boston from the 1970 census. The data is included in the R package mlbench. Using the data and
R software, we first study the scatterplot matrix and the correlation of different variables to find
their relations briefly. Then, we make various tests for many null hypotheses to examine the
properties of different models. Finally, we perform the model selections by using different methods
such as the forward algorithm, backward algorithm as well as the AIC and BIC criterion to find
and analyze the most fitted model for our data sets. And the same time, we also compute the SSPE
for our subset of data.

1
Contents
1 Introduction 2
2 Analysis 3
2.1 Analysis of the linearity between variables 3
2.1.1 Scatterplot matrix for variables 3
2.1.2 Explanation of Correlation between two variables 4
2.2 The statistical tests for the Null Hypotheses of the fitted model 6
2.3 Model selection by using the forward algorithm 9
2.4 Model selection by using the backward algorithm 11
2.5 Model selection by using the AIC and BIC criterion 12
2.6 Analysis of the related statistics 14
2.6.1 Fit the model by using the subset of the data 14
2.6.2 Compute and analyze the SSPE for subset of the data 14
3 Conclusion 16
4 Appendix 18

2
1 Introduction
The data of the Boston Housing from the 1970 census are used in this project. The dataset contains
14 variables with 506 observations. The data is included in the R package mlbench.
In this project, we used various tools to analyze the Boston Housing data and the most frequently
used method is the linear regression. At the same time, we also used hypothesis testing, t-test, F-
test as well as model selection as our methods to analyze the properties of the related data. Using
the data and R software, we first study the scatterplot matrix and the correlation of different
variables to find their relations briefly. Then, we make various tests for many null hypotheses to
examine the properties of different models. Finally, we perform the model selections by using
different methods such as the forward algorithm, backward algorithm as well as the AIC and BIC
criterion to find and analyze the most fitted model for our data sets. And the same time, we also
compute the SSPE for our subset of data.
The outline for the remainder of the paper is as follows. In Section 2, we provide the main results
and analysis towards the multiple aspects of our topics. Section 3 concludes. In the Appendix, we
provide our R codes as well as the related outputs. Finally, we also provide the references that we
use in this project. To be specific, the part 2.1.1 is for the question 1, part 2.1.2 is for the question
2, part 2.2 is for the question 3, part 2.3 is for the question 4, part 2.4 is for the question 5, part 2.5
is for the question 6, part 2.6 is for the question 7.

3
2 Analysis
To get a briefly understanding of the relationships between different variables at the very
beginning, we get the scatterplot matrix of these different variables and find the non-linearity
between these variables. Therefore, the correlation of these variables may not appropriate for
describing the relationships within the variables. At the same time, we also compute different test
statistics and test many hypotheses for the general model. Moreover, we also perform variable
selection using forward algorithm, backward algorithm, AIC and BIC criterion. We find that both
criterions select the same model for us and we explain the reason why the selected model is the
one that we need. Finally, we also get the fitted model for subset and compute and compare the
SSPE of the selected models.
2.1 Analysis of the linearity between variables
2.1.1 Scatterplot matrix for variables
First, according to the description of the R Package “mlbench”, we can get the meaning of the
following variables as well as the scatterplot matrix for these four variables which are listed below:
𝒏𝒐𝒙: Nitric oxides concentration (parts per 10 million).
𝒊𝒏𝒅𝒖𝒔: Proportion of residential land zoned for lots over 25,000 sq.ft.
𝒅𝒊𝒔: Weighted distances to five Boston employment centers.
𝒕𝒂𝒙: Full-value property-tax rate per USD 10,000.
plot 1 Scatterplot matrix for the variables nox, indus, dis, tax

4
According to the scatterplot matrix, we can find that these four variables are all related in some
patterns. For instance, generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable
𝑑𝑖𝑠 and the variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand,
generally speaking, the relationships between other variables are positive at the low volume level,
while the relationships may get vague and non-related at the high volume level.
On the other hand, we can also find the possible explanations according to the meanings of these
variables. Because the variable 𝑛𝑜𝑥 means the “Nitric oxides concentration (parts per 10 million).”,
which also represents the degree of air pollution in this area. For the variable 𝑑𝑖𝑠, it means the
“Weighted distances to five Boston employment centers.”, which also represents the degree of
living away from the downtown. And for the variable 𝑖𝑛𝑑𝑢𝑠 , it means the “Proportion of
residential land zoned for lots over 25,000 sq.ft.”, which also represents the level of economy of
residents. Because only if when people own enough money, will they use their money to build
their own parking lots which are also quite wide. Therefore, we can find the possible explanations
for these relationships. As we all know, the air of the area that far away from the downtown is
better because there are more trees and therefore, the level of pollution there can be at a low level.
So it is reason to see that there is negative relationship between the variable 𝑛𝑜𝑥 and 𝑑𝑖𝑠. On the
other hand, the degree of economic development in the areas that far away from the downtown is
worse than that of the downtown areas and therefore, the proportion of residential land zoned for
large lots is smaller than that of the downtown areas. So it is reason to see that there is negative
relationship between the variable 𝑖𝑛𝑑𝑢𝑠 and 𝑑𝑖𝑠.
2.1.2 Explanation of Correlation between two variables
We know that the formula of correlation coefficient between two variables is that:
ρ23 =
Cov(X, Y)
D(X) D(Y)
Therefore, according to the R codes, we can find that the correlation between the variable 𝑛𝑜𝑥 and
the variable 𝑑𝑖𝑠 is about -0.7692301, which may give us the information that these two variables
are negatively correlated.

5
However, the thing we should not forget is that the correlation coefficient between two variables
is used for examining the relationship for linear regression model, or in other words, the linear
relationship between two variables. But we can find from the scatterplot that the relationship
between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠 is more likely as exponential relationship, which
means that there is not reasonable to use the correlation coefficient between these two variables to
examine the relationship between them.
On the other hand, we can also test their relationship of them by getting the model between them.
From the model, we assume that there is an exponential relation between them and we get the
significantly p-value for this model. Therefore, according to the discussion above, we can safely
draw the conclusion that we cannot use the correlation between these two variables to quantify the
strength of relationship between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠.

6
2.2 The statistical tests for the Null Hypotheses of the fitted model
For this question, the full model given only contains five variables and intercept. 𝛽?means the
intercept, 𝛽@ measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑑𝑖𝑠 increased, 𝛽A
measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑙𝑜𝑔(𝑑𝑖𝑠) increases, 𝛽D measure
the change of the variable 𝑛𝑜𝑥 if one unit of variable 𝑑𝑖𝑠^2 increases, 𝛽H measure the change of
the variable 𝑛𝑜𝑥 if one unit of the variable 𝑖𝑛𝑑𝑢𝑠 increases, 𝛽I measure the change of the variable
𝑛𝑜𝑥 if one unit of the variable 𝑡𝑎𝑥 increases. Since three of them have already be given in the data
set, we just need to transform and add the left two variables which are log(dis) and dis^2. So we
create the new variables whose names are log(dis) and dissquare and will use them to refer the
variable log(dis) and the variable dis^2.
For this section, since we want to decide whether or not specified parameters are equal to 0 or each
other, we will do the F-test for all of three sub-questions. At the beginning, we have the formula
for F-test as below:
𝐹 = (
𝑅𝑆𝑆YZ − 𝑅𝑆𝑆Z
𝑑𝑓YZ − 𝑑𝑓Z
)/(
𝑅𝑆𝑆Z
𝑑𝑓Z
)
The table for the summary of F-test value and corresponding p-value for the question are
summarized below:
question a question b question c
F-test(value) 5.911 6.0524 42.80353
p-value 0.0154 0.002528 0.0001
Table 1 The F-test(value) and p-value of question a, b, c
We will use it to evaluate the question.
Question a:
According to the definition of the null hypothesis test, the main idea is to test whether the
coefficient of the variable log(dis) is equal to 0 or not. Since the variable log(dis) is the only
target we want to focus here, we can just build a new regression model which does not contain the
variable log(dis) to compare with the original regression model. When we compare two regression
models, we will do the F-test to see if they are significantly different with each other. Form the

7
results of the R codes, the F-value is 5.911 and corresponding p-value is 0.0154. To see whether
we need to reject the null hypothesis, it depends on significant level of alpha. Here, we set the
value of alpha to 0.05. Since the p-value is smaller than 0.05, we will reject the null hypothesis
and conclude that 𝛽A is not equal to 0 at the 95% confidence level. However, if we want to be 99%
confidence about the result, the alpha will change to be 0.01. Since the p-value is bigger than 0.01
here. We cannot reject the null hypothesis based at the 99% confidence level.
Question b:
For part b, we want to make sure whether the coefficient of the variable dis and the variable dis^2
are equal to 0 or not. Since it only focuses on two variable and want to make sure if they are
significantly different from 0. We can do the similar test as part a. For this question, we will build
another regression model which only contains intercept and three variables except the variable
dis^2 and the variable dis. Then we compare the new regression model with the original full
model to see if they are significantly different. We will also do a F-test to compare the two models.
Here the null hypothesis is that 𝛽@ = 𝛽D = 0, the alternative hypothesis is that as least one of them
is not equal to 0. The value for F-test is 6.0524 and corresponding p-value is 0.002528. Similarly,
we will also set the alpha to be 0.05 here. Since the p-value is smaller than 0.05, we will reject the
null hypothesis and conclude that among the 𝛽@, 𝛽D, at least one of them is not equal to 0.
Question c:
Situation for part c is much different. Since the question want to make sure if 𝛽A = 𝛽D = 0 and if
𝛽H = 𝛽I. We will not use the tradition way as above but use matrix to get the solution. We will
divide the first section(𝛽A = 𝛽D = 0) as if 𝛽A = 0 and if 𝛽D = 0. So the first line of matrix A has
only a “1” corresponding to the position of 𝛽A and “0” for all the other variables. For the second
line of matrix A, it only has a “1” corresponding to the position of 𝛽D and “0” for all the other
variables. When we compute the matrix A times the variable matrix, the first two line we can get
is only 𝛽A and 𝛽D. To make sure whether or not 𝛽H = 𝛽I. We will put “1” corresponding to the
position of 𝛽H and “-1” corresponding to the position of 𝛽I for the third line of matrix A. So the
output of third line will become 𝛽H - 𝛽I. To test if each of the result we get equal to “0”, we will
make a F-test here. The value for F-test is 42.80353 and p-value corresponding to it is less than
0.0001. Assuming we set alpha = 0.05 here, apparently the p is smaller than alpha, so we will reject

8
the null hypothesis here and conclude that at least one of 𝛽A = 𝛽D is not equal to 0 or 𝛽H is not equal
to 𝛽I.

9
2.3 Model selection by using the forward algorithm
In this section, we will use the method of forward algorithm to analyze the relationship between
response variable and potential explanatory variables below. Moreover, according to the question’s
requirements, we have transformed the original variables to different formats, which are all
presented below.
Response variable:
𝐥𝐨𝐠(𝐦𝐞𝐝𝐯), which means that we now use the natural logarithm of the median value of owner-
occupied homes in $1000's.
Potential explanatory variables:
𝐫𝐦^𝟐, which means the square of average number of rooms per dwelling.
𝐥𝐨𝐠(𝐝𝐢𝐬), which means the natural logarithm of weighted distances to five Boston employment
centers.
𝐚𝐠𝐞, which means the proportion of owner-occupied units built prior to 1940.
We performed variable selection using a forward algorithm with a significance level of 5%. For
the forward algorithm, we regressed the models with all variables separately. We name the models
from “forward11” to “forward14”, which you can find with details in Appendix. The results of the
regressions were summarized as the following:
name model variable t - value Pr(>|t|)
forward11 log(medv) ~ 1 intercept 167 <2e-16
forward12 log(medv) ~ rm^2 - 1 rm^2 130 <2e-16
forward13 log(medv) ~ age - 1 age 44.84 <2e-16
forward14 log(medv) ~ log(dis) - 1 log(dis) 54.66 <2e-16
Table 2 The summary of different models from forward11 to forward14
We could observe from the table that while all the variables are significant, the intercept has the
largest t-value. Hence, we chose the intercept to our model. Next, we regressed the intercept with
each of the left three variables in the models called “foward21” to “foward23”. The summarized
results are shown in the table below:

10
forward21 log(medv) ~ rm^2 rm^2 18.8 <2e-16
forward22 log(medv) ~ age age 11.42 <2e-16
forward23 log(medv) ~ log(dis) log(dis) 9.965 <2e-16
As shown in the table, the p-values of all the variables are significant. However, comparing to the
other variables, the t-value of the variable rm^2 has the largest one. So we added rm^2. Then, we
tested the combination of the variables rm^2, dis, age and intercept separately in the models,
which named as “forward31” and “forward32”. We got the following table as below:
forward31 log(medv) ~ rm^2 + log(dis) log(dis) 8.269 1.21e-15
forward32 log(medv) ~ rm^2 + age age -10.23 <2e-16
From the result above, the p-value of all the other variables are significant but the variable age has
smaller p-value than the variable dis. Therefore, we also add the variable age to our model. Finally,
we regressed the response variable log(medv) on all of the variables in the following model
“forward41”.
forward41 log(medv) ~ rm^2 + age + log(dis) log(dis) 1.068 0.286
Table 4 The summary of different models from forward41
Based on the table above, the variable log(dis) is not significant in the model and thus, we
removed it from our model. Therefore, after the forward algorithm selection, our final model is
shown as the following,
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term.

11
2.4 Model selection by using the backward algorithm
In this section, we will use the method of backward algorithm to analyze the relationship between
response variable and potential explanatory variables below. Moreover, according to the question’s
requirements, we used the transformed formats of the original variables that were defined in the
previous section. We performed variable selection using a backward algorithm with a significance
level of 5%. For the backward algorithm, we regressed the models with all variables at first. We
named the models from “backward11”, which you can find with details in Appendix. The results
of the regressions were summarized as the following:
backward11 log(medv) ~ rm^2 + age + log(dis) intercept 21.224 <2e-16
rm^2 17.676 <2e-16
age -5.758 1.48e-08
log(dis) 1.068 0.286
Table 5 The summary of different models from backward11
Based on the result, we can find that except the variable log(dis) whose t-value is 1.068 and p-
value is 0.286, all the explanatory variables are significant. Thus, we removed the variable dis and
built a new model with the left variables, which is called “backward21”. Here are the results:
backward21 log(medv) ~ rm^2 + age intercept 32.12 <2e-16
rm^2 17.85 <2e-16
age -10.23 <2e-16
Table 6 The summary of different models from backward21
After deleting the variable log(dis) from the model, we got left variables are all significant and
thus, we ended up with the model “backward21”. We got the same model as that by the process of
forward algorithm,
which ε is the error term.

12
2.5 Model selection by using the AIC and BIC criterion
First of all, we can do a preliminary analysis to the full model we are interested. In the full linear
regression model, the t-value and p-value are used to determine whether each of the variable is
significant for the model. Setting alpha = 0.05 here, we can see easily that the three of the variables
rm^2, age and intercept have smallest p-value that also less than 0.05 which means they are
significant. However, the variable log(dis) has the p-value of 0.286 which is not significant at all.
In this section, we will try to perform variable selection using AIC and BIC criterion. The
definition for AIC is that the measure of the relative quality of statistical models for a given set of
data. Given a collection of models for the data, AIC estimates the quality of each model, relative
to each of the other models. Hence, AIC provides a means for model selection. At the same time,
the definition for BIC is that the criterion for model selection among a finite set of models and the
model with the lowest BIC score is preferred. And the formulas for AIC and BIC are shown as
below,
𝐴𝐼𝐶(𝑚) = 𝑛 ∗ log
𝑅𝑆𝑆 𝑚
𝑛
+ 2 ∗ 𝑚v
𝐵𝐼𝐶(𝑚) = 𝑛 ∗ log
𝑅𝑆𝑆 𝑚
𝑛
+ log (𝑛) ∗ 𝑚v
where 𝑚 is the regression model, 𝑛 is the sample size, 𝑚v denotes the number of variables in the
model 𝑚. In the project, the sample size is 506 and all we need to do is to put all possible regression
model into the R software to compute the corresponding AIC and BIC scores. The candidate
models of the different regression model are summarized as below:

13
Candidate Models AIC Score BIC Score
log(medv) ~ 1 -904.371 -900.145
log(medv) ~ rm^2 - 1 -659.289 -655.063
log(medv) ~ age - 1 321.927 326.154
log(medv) ~ log(dis) - 1 155.969 160.195
log(medv) ~ rm^2 + log(dis) - 1 -750.189 -741.736
log(medv) ~ log(dis) + age - 1 -533.471 -525.018
log(medv) ~ rm^2 + age - 1 -702.556 -694.102
log(medv) ~ age -1018.83 -1010.378
log(medv) ~ rm^2 -1171.36 -1162.907
log(medv) ~ log(dis) -993.379 -984.926
log(medv) ~ rm^2 + log(dis) -1233.86 -1221.175
log(medv) ~ rm^2 + age -1265.07 -1252.394
log(medv) ~ log(dis) + age -1021.36 -1008.683
log(medv) ~ rm^2 + log(dis) + age - 1 -940.149 -929.47
log(medv) ~ rm^2 + log(dis) + age -1264.22 -1247.315
log(medv) ~ -1 1132.453 1132.453
Table 7 The AIC and BIC scores of all possible models
From the table above, we can find that the regression model with smallest AIC score has variable
rm^2, age as well as the intercept. The regression model with the smallest BIC score is the same
model. And when we checked the regression model, we can find the model only contains the
variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the model, all
the variables in the model are significant. So we will select the model which contains the variables
rm^2, age and intercept under the AIC as well as BIC criterion.

14
2.6 Analysis of the related statistics
2.6.1 Fit the model by using the subset of the data
According to the results above, we finally choose the model of “m12”, which has the minimum
value of BIC, to be used as our fitted model. From the question 6, we can find that the fitted model
can be written as the following,
which ε is the error term. Therefore, we can now use the data from Group1 to fit the above model.
From the results generated by R section, we can find that the fitted model is shown as the following:
log medv = 2.3360 + 0.0256 ∗ 𝑟𝑚A
− 0.0048 ∗ 𝑎𝑔𝑒
Moreover, the p-values of all the explanatory variables are all significant at all level.
2.6.2 Compute and analyze the SSPE for subset of the data
On the other hand, we can also apply another method which is called the Cross-Validation to
further analyze the model selection process. For this method, we need to apply the following
processes. First, we split the data into two different subsets according to a user defined criterion,
Group1 and Group2, which are also called the training data and the validation data. Second, we fit
the model using the data from the Group1. Third, based on the data from the Group2, we make the
prediction for the response variable log (𝑚𝑒𝑑𝑣)€. At the same time, we also denote the predicted
value by log (𝑚𝑒𝑑𝑣)•. At last, we compute the value of SSPE, which is also called “Sum of
Squared Prediction Error”.
Therefore, according to the question, we first divided the original data set “BostonHousing” into
two Groups, which is the Group1 and the Group2 respectively. And then, we can get the SSPE of
the Group2 by computing the SSPE according to its definition, which is shown as below:
SSPE = (log (𝑚𝑒𝑑𝑣)€ − log (𝑚𝑒𝑑𝑣)•)A
v
€…@
In the equation above, log (𝑚𝑒𝑑𝑣)€ denotes the response variables in the Group2 and log (𝑚𝑒𝑑𝑣)•
denotes the predicted values of the response variable, which were computed by the prediction
function in R section. Therefore, we can compute the SSPE of the Group2 as 0.02835043.
At the same time, we can find that the model we get from the part2.4(question 5) is that,

15
which ε is the error term. And the model is the same as we get from the part2.5(question 6).
Therefore, we get the same results for the same model.

16
3 Conclusion
In this project, we first got the scatterplot matrix of four different variables, 𝑛𝑜𝑥, 𝑖𝑛𝑑𝑢𝑠, 𝑑𝑖𝑠 and
𝑡𝑎𝑥. According to the scatterplot matrix, we found that these four variables are all related in some
patterns. Generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable 𝑑𝑖𝑠 and the
variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand, generally speaking,
the relationships between other variables are positive at the low volume level, while the
relationships may get vague and non-related at the high volume level. On the other hand, we can
also find the possible explanations according to the meanings of these variables. On the other hand,
we also found that the non-linearity between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠. Therefore, we
cannot use the correlation between these two variables to quantify the strength of relationship
between the variable and the variable 𝑑𝑖𝑠.
Second, we also made several tests for the Null hypotheses of the fitted model. Using the F-test
and the related p-values, we found that the p-values for the null hypotheses 𝛽A = 0, 𝛽@ = 𝛽D = 0,
𝛽@ = 𝛽D = 0 and 𝛽H = 𝛽I are all smaller than 0.05, which means we need to reject all the null
hypotheses.
Third, we used the forward algorithm to find the best model for the regression problem. To do that,
we first applied all the variables into the model and we used the p-values of different variables to
test that whether the certain variable is significant in the model. And then, we found that the final
model includes the variable 𝑟𝑚A
, the variable age as well as the intercept. According to the results,
we finally found the best model. At the same time, we also used the backward algorithm to do the
model selection. By using the backward algorithm, we first applied the model with nothing, and
then, we added the variables one by one into the model to test the p-values of these variables.
Finally, according to the results, we found that the model we found through the backward
algorithm is the same as the one found by using the forward algorithm.

17
And the same time, we also used both the AIC as well as the BIC criterion to do model selection
processes. After doing the model selection, we found that the regression model with smallest AIC
score has variable rm^2, age as well as the intercept. The regression model with the smallest BIC
score is the same model. And when we checked the regression model, we found the model only
contains the variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the
model, all the variables in the model are significant. So we will select the model which contains
the variables rm^2, age and intercept under the AIC as well as BIC criterion.
Finally, we also applied another method which is called the Cross-Validation to further analyze
the model selection process. And we also computed the sum of squared prediction error, SSPE, of
the Group2. At the same time, we can find that the model we get from the part2.4(question 5) is
the same as we get from the part2.5(question 6). Therefore, we get the same results for the same
model.

18
4 Appendix
The following materials are the related R codes that used for this project. The contents
with bold texts denote the original codes.
R codes:
# Question 1:
> nox <- BostonHousing$nox
> indus <- BostonHousing$indus
> dis <- BostonHousing$dis
> tax <- BostonHousing$tax
> pairs(~nox+indus+dis+tax,main="Scatterplot for nox,indus,dis,tax")
# Question 2:
> cor(nox,dis)
[1] -0.7692301
> model <- lm(nox~1/dis)
> summary(model)
# Question 3:
(a)
> library("mlbench", lib.loc="~/Library/R/3.3/library")
> data("BostonHousing")
> BostonHousing <- transform(BostonHousing, logdis = log(dis))
> BostonHousing <- transform(BostonHousing, dissquare = dis*dis)
> u1 <- lm(nox ~ dis+logdis+dissquare +indus + tax, BostonHousing)
> u2 <- lm(nox ~ dis+dissquare +indus + tax, BostonHousing)
> anova(u1,u2)
Analysis of Variance Table
Model 1: nox ~ dis + logdis + dissquare + indus + tax
Model 2: nox ~ dis + dissquare + indus + tax
Res.Df RSS Df Sum of Sq F Pr(>F)
1 500 1.6897
2 501 1.7097 -1 -0.019976 5.911 0.0154 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(b)
> u3 <- lm(nox ~ logdis +indus + tax, BostonHousing)
> anova(u1,u3)
Analysis of Variance Table
Model 1: nox ~ dis + logdis + dissquare + indus + tax
Model 2: nox ~ logdis + indus + tax
Res.Df RSS Df Sum of Sq F Pr(>F)
1 500 1.6897

19
2 502 1.7306 -2 -0.040907 6.0524 0.002528 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(c)
> A = matrix(c(0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-1),nrow=3,byrow=TRUE)
> Model <- lm(nox ~ dis + logdis + dissquare + indus + tax, BostonHousing)
> variance <- (A %*% vcov(Model) %*% t(A))
> E <- eigen(variance, TRUE)
> Evalues <- E$values
> Evectors <-E$vectors
> sqrtvariance <- Evectors %*% diag(1/sqrt(Evalues)) %*% t(Evectors)
> Z <- sqrtvariance %*% A %*% coef(Model)
> F <- sum(Z^2)/3
> F
[1] 42.80353
# Question 4:
> Medv<-log(medv)
> Rm<-(rm)^2
> Dis<-log(dis)
> forward11<-lm(Medv~1)
> summary(forward11)
Call:
lm(formula = Medv ~ 1)
Residuals:
Min 1Q Median 3Q Max
-1.42507 -0.19983 0.01949 0.18436 0.87751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.03451 0.01817 167 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4088 on 505 degrees of freedom
> forward12<-lm(Medv~Rm-1)
Call:
lm(formula = Medv ~ Rm - 1)

20
Residuals:
-2.5860 -0.1694 0.1560 0.4042 2.3811
Coefficients:
Rm 0.0735845 0.0005646 130.3 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared: 0.9711, Adjusted R-squared: 0.9711
F-statistic: 1.699e+04 on 1 and 505 DF, p-value: < 2.2e-16
> forward13<-lm(Medv~age-1)
Call:
lm(formula = Medv ~ age - 1)
Residuals:
-2.0839 -0.5927 0.3357 1.5142 3.4463
Coefficients:
age 0.0369330 0.0008236 44.84 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 2011 on 1 and 505 DF, p-value: < 2.2e-16
> forward14<-lm(Medv~Dis-1)
Call:
lm(formula = Medv ~ Dis - 1)
Residuals:
-2.2240 -0.4238 0.6628 1.2085 3.6475
Coefficients:

21
Dis 2.17068 0.03972 54.66 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 2987 on 1 and 505 DF, p-value: < 2.2e-16
> forward21<-lm(Medv~Rm)
Call:
lm(formula = Medv ~ Rm)
Residuals:
-1.20269 -0.10530 0.06992 0.17255 1.31948
Coefficients:
(Intercept) 1.878478 0.063036 29.8 <2e-16 ***
Rm 0.028909 0.001537 18.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 353.6 on 1 and 504 DF, p-value: < 2.2e-16
> forward22<-lm(Medv~age)
Call:
lm(formula = Medv ~ age)
Residuals:
-1.21816 -0.20280 -0.01733 0.16722 1.08442
Coefficients:
(Intercept) 3.4860274 0.0427295 81.58 <2e-16 ***
age -0.0065843 0.0005765 -11.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

22
> forward23<-lm(Medv~Dis)
Call:
lm(formula = Medv ~ Dis)
Residuals:
-1.18240 -0.21227 -0.02365 0.16558 1.20522
Coefficients:
(Intercept) 2.66935 0.04024 66.338 <2e-16 ***
Dis 0.30737 0.03084 9.965 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> forward31<-lm(Medv~Rm+Dis)
Call:
lm(formula = Medv ~ Rm + Dis)
Residuals:
-1.05461 -0.12689 0.03383 0.16131 1.46235
Coefficients:
(Intercept) 1.746011 0.061332 28.468 < 2e-16 ***
Rm 0.026088 0.001484 17.585 < 2e-16 ***
Dis 0.206437 0.024965 8.269 1.21e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

23
> forward32<-lm(Medv~Rm+age)
Call:
lm(formula = Medv ~ Rm + age)
Residuals:
-1.0789 -0.1094 0.0335 0.1300 1.4183
Coefficients:
(Intercept) 2.3346303 0.0726764 32.12 <2e-16 ***
Rm 0.0256312 0.0014361 17.85 <2e-16 ***
age -0.0047407 0.0004632 -10.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> forward41<-lm(Medv~Rm+Dis+age)
Call:
lm(formula = Medv ~ Rm + Dis + age)
Residuals:
-1.06502 -0.11534 0.02519 0.13058 1.43388
Coefficients:
(Intercept) 2.2520854 0.1061091 21.224 < 2e-16 ***
Rm 0.0254895 0.0014420 17.676 < 2e-16 ***
Dis 0.0402145 0.0376701 1.068 0.286
age -0.0041510 0.0007209 -5.758 1.48e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

24
> forward<-lm(Medv~Rm+age)
> summary(forward)
Call:
Residuals:
-1.0789 -0.1094 0.0335 0.1300 1.4183
Coefficients:
(Intercept) 2.3346303 0.0726764 32.12 <2e-16 ***
Rm 0.0256312 0.0014361 17.85 <2e-16 ***
age -0.0047407 0.0004632 -10.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Question 5:
> backward11<-lm(Medv ~Rm+ age+ Dis)
> summary(backward11)
Call:
lm(formula = Medv ~ Rm + age + Dis)
Residuals:
-1.06502 -0.11534 0.02519 0.13058 1.43388
Coefficients:
(Intercept) 2.2520854 0.1061091 21.224 < 2e-16 ***
Rm 0.0254895 0.0014420 17.676 < 2e-16 ***
age -0.0041510 0.0007209 -5.758 1.48e-08 ***
Dis 0.0402145 0.0376701 1.068 0.286
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

25
> backward21<-lm(Medv ~Rm+ age)
> summary(backward21)
Call:
Residuals:
-1.0789 -0.1094 0.0335 0.1300 1.4183
Coefficients:
(Intercept) 2.3346303 0.0726764 32.12 <2e-16 ***
Rm 0.0256312 0.0014361 17.85 <2e-16 ***
age -0.0047407 0.0004632 -10.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Question 6:
> data("BostonHousing",package="mlbench")
> BostonHousing <- transform(BostonHousing, logmedv = log(medv))
> BostonHousing <- transform(BostonHousing, rmsq = rm*rm)
> attach(BostonHousing)
> n <- 506
> m1<-lm(logmedv~1)
> m2<-lm(logmedv~rmsq-1)
> m3<-lm(logmedv~age-1)
> m4<-lm(logmedv~logdis-1)
> m5<-lm(logmedv~rmsq+logdis-1)
> m6<-lm(logmedv~logdis+age-1)
> m7<-lm(logmedv~rmsq+age-1)
> m8<-lm(logmedv~age)
> m9<-lm(logmedv~rmsq)
> m10<-lm(logmedv~logdis)
> m11<-lm(logmedv~rmsq+logdis)
> m12<-lm(logmedv~rmsq+age)
> m13<-lm(logmedv~logdis+age)
> m14<-lm(logmedv~rmsq+logdis+age-1)
> m15<-lm(logmedv~rmsq+logdis+age)

26
> m16<-lm(logmedv~-1)
>
> AIC1=n*log(sum(m1$residuals^2)/n)+2*1
>
>
> BIC1=n*log(sum(m1$residuals^2)/n)+log(n)*1
>
min(AIC1,AIC2,AIC3,AIC4,AIC5,AIC6,AIC7,AIC8,AIC9,AIC10,AIC11,AIC12,AIC13,AI
C14,AIC15,AIC16)
[1] -1265.073
> AIC12
[1] -1265.073
>
min(BIC1,BIC2,BIC3,BIC4,BIC5,BIC6,BIC7,BIC8,BIC9,BIC10,BIC11,BIC12,BIC13,BIC
14,BIC15,BIC16)
[1] -1252.394

27
> BIC12
[1] -1252.394
# Question 7:
> data("BostonHousing",package="mlbench")
> BostonHousing <- transform(BostonHousing, logmedv = log(medv))
> BostonHousing <- transform(BostonHousing, rmsq = rm*rm)
> attach(BostonHousing)
>
> Group1 <- subset(BostonHousing,BostonHousing$zn!=55.0)
> Group2 <- subset(BostonHousing,BostonHousing$zn==55.0)
> fitmodel <- lm(logmedv~rmsq+age,data = Group1)
> summary(fitmodel)
Call:
lm(formula = logmedv ~ rmsq + age, data = Group1)
Residuals:
-1.07887 -0.10964 0.03389 0.13020 1.41838
Coefficients:
(Intercept) 2.3360096 0.0729281 32.03 <2e-16 ***
rmsq 0.0256286 0.0014406 17.79 <2e-16 ***
age -0.0047542 0.0004661 -10.20 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> p <- predict(fitmodel,newdata=Group2)
> SSPE <- sum((Group2$logmedv-p)^2)
> SSPE
[1] 0.02835043

Analysis of Boston Housing Data Models

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (16)

Semelhante a Analysis of Boston Housing Data Models

Semelhante a Analysis of Boston Housing Data Models (20)

Analysis of Boston Housing Data Models