SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
1
ysstats@bu.edu U37074009
	
Analysis of the Boston Housing Data
from the 1970 census:
Diverse Tests and Model Selection Processes regarding the
Variables in Boston Housing Data
Shuai Yuan1
December 8, 2016
Abstract
In this project, we study the Boston Housing Data that was offered by Harrison and Rubinfeld,
1978. The data contained many different variables that related to Boston Housing for 506 tracts of
Boston from the 1970 census. The data is included in the R package mlbench. Using the data and
R software, we first study the scatterplot matrix and the correlation of different variables to find
their relations briefly. Then, we make various tests for many null hypotheses to examine the
properties of different models. Finally, we perform the model selections by using different methods
such as the forward algorithm, backward algorithm as well as the AIC and BIC criterion to find
and analyze the most fitted model for our data sets. And the same time, we also compute the SSPE
for our subset of data.
1
Contents
1 Introduction 2
2 Analysis 3
2.1 Analysis of the linearity between variables 3
2.1.1 Scatterplot matrix for variables 3
2.1.2 Explanation of Correlation between two variables 4
2.2 The statistical tests for the Null Hypotheses of the fitted model 6
2.3 Model selection by using the forward algorithm 9
2.4 Model selection by using the backward algorithm 11
2.5 Model selection by using the AIC and BIC criterion 12
2.6 Analysis of the related statistics 14
2.6.1 Fit the model by using the subset of the data 14
2.6.2 Compute and analyze the SSPE for subset of the data 14
3 Conclusion 16
4 Appendix 18
2
1 Introduction
The data of the Boston Housing from the 1970 census are used in this project. The dataset contains
14 variables with 506 observations. The data is included in the R package mlbench.
In this project, we used various tools to analyze the Boston Housing data and the most frequently
used method is the linear regression. At the same time, we also used hypothesis testing, t-test, F-
test as well as model selection as our methods to analyze the properties of the related data. Using
the data and R software, we first study the scatterplot matrix and the correlation of different
variables to find their relations briefly. Then, we make various tests for many null hypotheses to
examine the properties of different models. Finally, we perform the model selections by using
different methods such as the forward algorithm, backward algorithm as well as the AIC and BIC
criterion to find and analyze the most fitted model for our data sets. And the same time, we also
compute the SSPE for our subset of data.
The outline for the remainder of the paper is as follows. In Section 2, we provide the main results
and analysis towards the multiple aspects of our topics. Section 3 concludes. In the Appendix, we
provide our R codes as well as the related outputs. Finally, we also provide the references that we
use in this project. To be specific, the part 2.1.1 is for the question 1, part 2.1.2 is for the question
2, part 2.2 is for the question 3, part 2.3 is for the question 4, part 2.4 is for the question 5, part 2.5
is for the question 6, part 2.6 is for the question 7.
3
2 Analysis
To get a briefly understanding of the relationships between different variables at the very
beginning, we get the scatterplot matrix of these different variables and find the non-linearity
between these variables. Therefore, the correlation of these variables may not appropriate for
describing the relationships within the variables. At the same time, we also compute different test
statistics and test many hypotheses for the general model. Moreover, we also perform variable
selection using forward algorithm, backward algorithm, AIC and BIC criterion. We find that both
criterions select the same model for us and we explain the reason why the selected model is the
one that we need. Finally, we also get the fitted model for subset and compute and compare the
SSPE of the selected models.
2.1 Analysis of the linearity between variables
2.1.1 Scatterplot matrix for variables
First, according to the description of the R Package “mlbench”, we can get the meaning of the
following variables as well as the scatterplot matrix for these four variables which are listed below:
𝒏𝒐𝒙: Nitric oxides concentration (parts per 10 million).
𝒊𝒏𝒅𝒖𝒔: Proportion of residential land zoned for lots over 25,000 sq.ft.
𝒅𝒊𝒔: Weighted distances to five Boston employment centers.
𝒕𝒂𝒙: Full-value property-tax rate per USD 10,000.
plot 1 Scatterplot matrix for the variables nox, indus, dis, tax
4
According to the scatterplot matrix, we can find that these four variables are all related in some
patterns. For instance, generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable
𝑑𝑖𝑠 and the variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand,
generally speaking, the relationships between other variables are positive at the low volume level,
while the relationships may get vague and non-related at the high volume level.
On the other hand, we can also find the possible explanations according to the meanings of these
variables. Because the variable 𝑛𝑜𝑥 means the “Nitric oxides concentration (parts per 10 million).”,
which also represents the degree of air pollution in this area. For the variable 𝑑𝑖𝑠, it means the
“Weighted distances to five Boston employment centers.”, which also represents the degree of
living away from the downtown. And for the variable 𝑖𝑛𝑑𝑢𝑠 , it means the “Proportion of
residential land zoned for lots over 25,000 sq.ft.”, which also represents the level of economy of
residents. Because only if when people own enough money, will they use their money to build
their own parking lots which are also quite wide. Therefore, we can find the possible explanations
for these relationships. As we all know, the air of the area that far away from the downtown is
better because there are more trees and therefore, the level of pollution there can be at a low level.
So it is reason to see that there is negative relationship between the variable 𝑛𝑜𝑥 and 𝑑𝑖𝑠. On the
other hand, the degree of economic development in the areas that far away from the downtown is
worse than that of the downtown areas and therefore, the proportion of residential land zoned for
large lots is smaller than that of the downtown areas. So it is reason to see that there is negative
relationship between the variable 𝑖𝑛𝑑𝑢𝑠 and 𝑑𝑖𝑠.
2.1.2 Explanation of Correlation between two variables
We know that the formula of correlation coefficient between two variables is that:
ρ23 =
Cov(X, Y)
D(X) D(Y)
Therefore, according to the R codes, we can find that the correlation between the variable 𝑛𝑜𝑥 and
the variable 𝑑𝑖𝑠 is about -0.7692301, which may give us the information that these two variables
are negatively correlated.
5
However, the thing we should not forget is that the correlation coefficient between two variables
is used for examining the relationship for linear regression model, or in other words, the linear
relationship between two variables. But we can find from the scatterplot that the relationship
between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠 is more likely as exponential relationship, which
means that there is not reasonable to use the correlation coefficient between these two variables to
examine the relationship between them.
On the other hand, we can also test their relationship of them by getting the model between them.
From the model, we assume that there is an exponential relation between them and we get the
significantly p-value for this model. Therefore, according to the discussion above, we can safely
draw the conclusion that we cannot use the correlation between these two variables to quantify the
strength of relationship between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠.
6
2.2 The statistical tests for the Null Hypotheses of the fitted model
For this question, the full model given only contains five variables and intercept. 𝛽?means the
intercept, 𝛽@ measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑑𝑖𝑠 increased, 𝛽A
measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑙𝑜𝑔(𝑑𝑖𝑠) increases, 𝛽D measure
the change of the variable 𝑛𝑜𝑥 if one unit of variable 𝑑𝑖𝑠^2 increases,	𝛽H measure the change of
the variable 𝑛𝑜𝑥 if one unit of the variable 𝑖𝑛𝑑𝑢𝑠 increases, 𝛽I measure the change of the variable
𝑛𝑜𝑥 if one unit of the variable 𝑡𝑎𝑥 increases. Since three of them have already be given in the data
set, we just need to transform and add the left two variables which are log(dis) and dis^2. So we
create the new variables whose names are log(dis) and dissquare and will use them to refer the
variable log(dis) and the variable dis^2.
For this section, since we want to decide whether or not specified parameters are equal to 0 or each
other, we will do the F-test for all of three sub-questions. At the beginning, we have the formula
for F-test as below:
𝐹 = (
𝑅𝑆𝑆YZ − 𝑅𝑆𝑆Z
𝑑𝑓YZ − 𝑑𝑓Z
)/(
𝑅𝑆𝑆Z
𝑑𝑓Z
)
The table for the summary of F-test value and corresponding p-value for the question are
summarized below:
question a question b question c
F-test(value) 5.911 6.0524 42.80353
p-value 0.0154 0.002528 0.0001
Table 1 The F-test(value) and p-value of question a, b, c
We will use it to evaluate the question.
Question a:
According to the definition of the null hypothesis test, the main idea is to test whether the
coefficient of the variable log(dis) is equal to 0 or not. Since the variable log(dis) is the only
target we want to focus here, we can just build a new regression model which does not contain the
variable log(dis) to compare with the original regression model. When we compare two regression
models, we will do the F-test to see if they are significantly different with each other. Form the
7
results of the R codes, the F-value is 5.911 and corresponding p-value is 0.0154. To see whether
we need to reject the null hypothesis, it depends on significant level of alpha. Here, we set the
value of alpha to 0.05. Since the p-value is smaller than 0.05, we will reject the null hypothesis
and conclude that 𝛽A is not equal to 0 at the 95% confidence level. However, if we want to be 99%
confidence about the result, the alpha will change to be 0.01. Since the p-value is bigger than 0.01
here. We cannot reject the null hypothesis based at the 99% confidence level.
Question b:
For part b, we want to make sure whether the coefficient of the variable dis and the variable dis^2
are equal to 0 or not. Since it only focuses on two variable and want to make sure if they are
significantly different from 0. We can do the similar test as part a. For this question, we will build
another regression model which only contains intercept and three variables except the variable
dis^2 and the variable dis. Then we compare the new regression model with the original full
model to see if they are significantly different. We will also do a F-test to compare the two models.
Here the null hypothesis is that 𝛽@ = 𝛽D = 0, the alternative hypothesis is that as least one of them
is not equal to 0. The value for F-test is 6.0524 and corresponding p-value is 0.002528. Similarly,
we will also set the alpha to be 0.05 here. Since the p-value is smaller than 0.05, we will reject the
null hypothesis and conclude that among the 𝛽@, 𝛽D, at least one of them is not equal to 0.
Question c:
Situation for part c is much different. Since the question want to make sure if 𝛽A = 𝛽D = 0 and if
𝛽H = 𝛽I. We will not use the tradition way as above but use matrix to get the solution. We will
divide the first section(𝛽A = 𝛽D = 0) as if 𝛽A = 0 and if 𝛽D = 0. So the first line of matrix A has
only a “1” corresponding to the position of 𝛽A and “0” for all the other variables. For the second
line of matrix A, it only has a “1” corresponding to the position of 𝛽D and “0” for all the other
variables. When we compute the matrix A times the variable matrix, the first two line we can get
is only 𝛽A and 𝛽D. To make sure whether or not 𝛽H = 𝛽I. We will put “1” corresponding to the
position of 𝛽H and “-1” corresponding to the position of 𝛽I for the third line of matrix A. So the
output of third line will become 𝛽H - 𝛽I. To test if each of the result we get equal to “0”, we will
make a F-test here. The value for F-test is 42.80353 and p-value corresponding to it is less than
0.0001. Assuming we set alpha = 0.05 here, apparently the p is smaller than alpha, so we will reject
8
the null hypothesis here and conclude that at least one of 𝛽A = 𝛽D is not equal to 0 or 𝛽H is not equal
to 𝛽I.
9
2.3 Model selection by using the forward algorithm
In this section, we will use the method of forward algorithm to analyze the relationship between
response variable and potential explanatory variables below. Moreover, according to the question’s
requirements, we have transformed the original variables to different formats, which are all
presented below.
Response variable:
𝐥𝐨𝐠(𝐦𝐞𝐝𝐯), which means that we now use the natural logarithm of the median value of owner-
occupied homes in $1000's.
Potential explanatory variables:
𝐫𝐦^𝟐, which means the square of average number of rooms per dwelling.
𝐥𝐨𝐠(𝐝𝐢𝐬), which means the natural logarithm of weighted distances to five Boston employment
centers.
𝐚𝐠𝐞, which means the proportion of owner-occupied units built prior to 1940.
We performed variable selection using a forward algorithm with a significance level of 5%. For
the forward algorithm, we regressed the models with all variables separately. We name the models
from “forward11” to “forward14”, which you can find with details in Appendix. The results of the
regressions were summarized as the following:
name model variable t - value Pr(>|t|)
forward11 log(medv) ~ 1 intercept 167 <2e-16
forward12 log(medv) ~ rm^2 - 1 rm^2 130 <2e-16
forward13 log(medv) ~ age - 1 age 44.84 <2e-16
forward14 log(medv) ~ log(dis) - 1 log(dis) 54.66 <2e-16
Table 2 The summary of different models from forward11 to forward14
We could observe from the table that while all the variables are significant, the intercept has the
largest t-value. Hence, we chose the intercept to our model. Next, we regressed the intercept with
each of the left three variables in the models called “foward21” to “foward23”. The summarized
results are shown in the table below:
10
name model variable t - value Pr(>|t|)
forward21 log(medv) ~ rm^2 rm^2 18.8 <2e-16
forward22 log(medv) ~ age age 11.42 <2e-16
forward23 log(medv) ~ log(dis) log(dis) 9.965 <2e-16
Table 3 The summary of different models from forward21 to forward23
As shown in the table, the p-values of all the variables are significant. However, comparing to the
other variables, the t-value of the variable rm^2 has the largest one. So we added rm^2. Then, we
tested the combination of the variables rm^2, dis, age and intercept separately in the models,
which named as “forward31” and “forward32”. We got the following table as below:
name model variable t - value Pr(>|t|)
forward31 log(medv) ~ rm^2 + log(dis) log(dis) 8.269 1.21e-15
forward32 log(medv) ~ rm^2 + age age -10.23 <2e-16
Table 4 The summary of different models from forward31 to forward32
From the result above, the p-value of all the other variables are significant but the variable age has
smaller p-value than the variable dis. Therefore, we also add the variable age to our model. Finally,
we regressed the response variable log(medv) on all of the variables in the following model
“forward41”.
name model variable t - value Pr(>|t|)
forward41 log(medv) ~ rm^2 + age + log(dis) log(dis) 1.068 0.286
Table 4 The summary of different models from forward41
Based on the table above, the variable log(dis) is not significant in the model and thus, we
removed it from our model. Therefore, after the forward algorithm selection, our final model is
shown as the following,
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term.
11
2.4 Model selection by using the backward algorithm
In this section, we will use the method of backward algorithm to analyze the relationship between
response variable and potential explanatory variables below. Moreover, according to the question’s
requirements, we used the transformed formats of the original variables that were defined in the
previous section. We performed variable selection using a backward algorithm with a significance
level of 5%. For the backward algorithm, we regressed the models with all variables at first. We
named the models from “backward11”, which you can find with details in Appendix. The results
of the regressions were summarized as the following:
name model variable t - value Pr(>|t|)
backward11 log(medv) ~ rm^2 + age + log(dis) intercept 21.224 <2e-16
rm^2 17.676 <2e-16
age -5.758 1.48e-08
log(dis) 1.068 0.286
Table 5 The summary of different models from backward11
Based on the result, we can find that except the variable log(dis) whose t-value is 1.068 and p-
value is 0.286, all the explanatory variables are significant. Thus, we removed the variable dis and
built a new model with the left variables, which is called “backward21”. Here are the results:
name model variable t - value Pr(>|t|)
backward21 log(medv) ~ rm^2 + age intercept 32.12 <2e-16
rm^2 17.85 <2e-16
age -10.23 <2e-16
Table 6 The summary of different models from backward21
After deleting the variable log(dis) from the model, we got left variables are all significant and
thus, we ended up with the model “backward21”. We got the same model as that by the process of
forward algorithm,
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term.
12
2.5 Model selection by using the AIC and BIC criterion
First of all, we can do a preliminary analysis to the full model we are interested. In the full linear
regression model, the t-value and p-value are used to determine whether each of the variable is
significant for the model. Setting alpha = 0.05 here, we can see easily that the three of the variables
rm^2, age and intercept have smallest p-value that also less than 0.05 which means they are
significant. However, the variable log(dis) has the p-value of 0.286 which is not significant at all.
In this section, we will try to perform variable selection using AIC and BIC criterion. The
definition for AIC is that the measure of the relative quality of statistical models for a given set of
data. Given a collection of models for the data, AIC estimates the quality of each model, relative
to each of the other models. Hence, AIC provides a means for model selection. At the same time,
the definition for BIC is that the criterion for model selection among a finite set of models and the
model with the lowest BIC score is preferred. And the formulas for AIC and BIC are shown as
below,
𝐴𝐼𝐶(𝑚) = 𝑛 ∗ log
𝑅𝑆𝑆 𝑚
𝑛
+ 2 ∗ 𝑚v
𝐵𝐼𝐶(𝑚) = 𝑛 ∗ log
𝑅𝑆𝑆 𝑚
𝑛
+ log	(𝑛) ∗ 𝑚v
where 𝑚 is the regression model, 𝑛 is the sample size, 𝑚v denotes the number of variables in the
model 𝑚. In the project, the sample size is 506 and all we need to do is to put all possible regression
model into the R software to compute the corresponding AIC and BIC scores. The candidate
models of the different regression model are summarized as below:
13
Candidate Models AIC Score BIC Score
log(medv) ~ 1 -904.371 -900.145
log(medv) ~ rm^2 - 1 -659.289 -655.063
log(medv) ~ age - 1 321.927 326.154
log(medv) ~ log(dis) - 1 155.969 160.195
log(medv) ~ rm^2 + log(dis) - 1 -750.189 -741.736
log(medv) ~ log(dis) + age - 1 -533.471 -525.018
log(medv) ~ rm^2 + age - 1 -702.556 -694.102
log(medv) ~ age -1018.83 -1010.378
log(medv) ~ rm^2 -1171.36 -1162.907
log(medv) ~ log(dis) -993.379 -984.926
log(medv) ~ rm^2 + log(dis) -1233.86 -1221.175
log(medv) ~ rm^2 + age -1265.07 -1252.394
log(medv) ~ log(dis) + age -1021.36 -1008.683
log(medv) ~ rm^2 + log(dis) + age - 1 -940.149 -929.47
log(medv) ~ rm^2 + log(dis) + age -1264.22 -1247.315
log(medv) ~ -1 1132.453 1132.453
Table 7 The AIC and BIC scores of all possible models
From the table above, we can find that the regression model with smallest AIC score has variable
rm^2, age as well as the intercept. The regression model with the smallest BIC score is the same
model. And when we checked the regression model, we can find the model only contains the
variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the model, all
the variables in the model are significant. So we will select the model which contains the variables
rm^2, age and intercept under the AIC as well as BIC criterion.
14
2.6 Analysis of the related statistics
2.6.1 Fit the model by using the subset of the data
According to the results above, we finally choose the model of “m12”, which has the minimum
value of BIC, to be used as our fitted model. From the question 6, we can find that the fitted model
can be written as the following,
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term. Therefore, we can now use the data from Group1 to fit the above model.
From the results generated by R section, we can find that the fitted model is shown as the following:
log medv = 2.3360 + 0.0256 ∗ 𝑟𝑚A
− 0.0048 ∗ 𝑎𝑔𝑒
Moreover, the p-values of all the explanatory variables are all significant at all level.
2.6.2 Compute and analyze the SSPE for subset of the data
On the other hand, we can also apply another method which is called the Cross-Validation to
further analyze the model selection process. For this method, we need to apply the following
processes. First, we split the data into two different subsets according to a user defined criterion,
Group1 and Group2, which are also called the training data and the validation data. Second, we fit
the model using the data from the Group1. Third, based on the data from the Group2, we make the
prediction for the response variable log	(𝑚𝑒𝑑𝑣)€. At the same time, we also denote the predicted
value by log	(𝑚𝑒𝑑𝑣)•. At last, we compute the value of SSPE, which is also called “Sum of
Squared Prediction Error”.
Therefore, according to the question, we first divided the original data set “BostonHousing” into
two Groups, which is the Group1 and the Group2 respectively. And then, we can get the SSPE of
the Group2 by computing the SSPE according to its definition, which is shown as below:
SSPE = (log	(𝑚𝑒𝑑𝑣)€ − log	(𝑚𝑒𝑑𝑣)•)A
v
€…@
In the equation above, log	(𝑚𝑒𝑑𝑣)€ denotes the response variables in the Group2 and log	(𝑚𝑒𝑑𝑣)•
denotes the predicted values of the response variable, which were computed by the prediction
function in R section. Therefore, we can compute the SSPE of the Group2 as 0.02835043.
At the same time, we can find that the model we get from the part2.4(question 5) is that,
15
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term. And the model is the same as we get from the part2.5(question 6).
Therefore, we get the same results for the same model.
16
3 Conclusion
In this project, we first got the scatterplot matrix of four different variables, 𝑛𝑜𝑥, 𝑖𝑛𝑑𝑢𝑠, 𝑑𝑖𝑠 and
𝑡𝑎𝑥. According to the scatterplot matrix, we found that these four variables are all related in some
patterns. Generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable 𝑑𝑖𝑠 and the
variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand, generally speaking,
the relationships between other variables are positive at the low volume level, while the
relationships may get vague and non-related at the high volume level. On the other hand, we can
also find the possible explanations according to the meanings of these variables. On the other hand,
we also found that the non-linearity between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠. Therefore, we
cannot use the correlation between these two variables to quantify the strength of relationship
between the variable and the variable 𝑑𝑖𝑠.
Second, we also made several tests for the Null hypotheses of the fitted model. Using the F-test
and the related p-values, we found that the p-values for the null hypotheses 𝛽A = 0, 𝛽@ = 𝛽D = 0,
𝛽@ = 𝛽D = 0 and 𝛽H = 𝛽I are all smaller than 0.05, which means we need to reject all the null
hypotheses.
Third, we used the forward algorithm to find the best model for the regression problem. To do that,
we first applied all the variables into the model and we used the p-values of different variables to
test that whether the certain variable is significant in the model. And then, we found that the final
model includes the variable 𝑟𝑚A
, the variable age as well as the intercept. According to the results,
we finally found the best model. At the same time, we also used the backward algorithm to do the
model selection. By using the backward algorithm, we first applied the model with nothing, and
then, we added the variables one by one into the model to test the p-values of these variables.
Finally, according to the results, we found that the model we found through the backward
algorithm is the same as the one found by using the forward algorithm.
17
And the same time, we also used both the AIC as well as the BIC criterion to do model selection
processes. After doing the model selection, we found that the regression model with smallest AIC
score has variable rm^2, age as well as the intercept. The regression model with the smallest BIC
score is the same model. And when we checked the regression model, we found the model only
contains the variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the
model, all the variables in the model are significant. So we will select the model which contains
the variables rm^2, age and intercept under the AIC as well as BIC criterion.
Finally, we also applied another method which is called the Cross-Validation to further analyze
the model selection process. And we also computed the sum of squared prediction error, SSPE, of
the Group2. At the same time, we can find that the model we get from the part2.4(question 5) is
the same as we get from the part2.5(question 6). Therefore, we get the same results for the same
model.
18
4 Appendix
The following materials are the related R codes that used for this project. The contents
with bold texts denote the original codes.
R codes:
# Question 1:
> nox <- BostonHousing$nox
> indus <- BostonHousing$indus
> dis <- BostonHousing$dis
> tax <- BostonHousing$tax
> pairs(~nox+indus+dis+tax,main="Scatterplot for nox,indus,dis,tax")
# Question 2:
> cor(nox,dis)
[1] -0.7692301
> model <- lm(nox~1/dis)
> summary(model)
# Question 3:
(a)
> library("mlbench", lib.loc="~/Library/R/3.3/library")
> data("BostonHousing")
> BostonHousing <- transform(BostonHousing, logdis = log(dis))
> BostonHousing <- transform(BostonHousing, dissquare = dis*dis)
> u1 <- lm(nox ~ dis+logdis+dissquare +indus + tax, BostonHousing)
> u2 <- lm(nox ~ dis+dissquare +indus + tax, BostonHousing)
> anova(u1,u2)
Analysis of Variance Table
Model 1: nox ~ dis + logdis + dissquare + indus + tax
Model 2: nox ~ dis + dissquare + indus + tax
Res.Df RSS Df Sum of Sq F Pr(>F)
1 500 1.6897
2 501 1.7097 -1 -0.019976 5.911 0.0154 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(b)
> u3 <- lm(nox ~ logdis +indus + tax, BostonHousing)
> anova(u1,u3)
Analysis of Variance Table
Model 1: nox ~ dis + logdis + dissquare + indus + tax
Model 2: nox ~ logdis + indus + tax
Res.Df RSS Df Sum of Sq F Pr(>F)
1 500 1.6897
19
2 502 1.7306 -2 -0.040907 6.0524 0.002528 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(c)
> A = matrix(c(0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-1),nrow=3,byrow=TRUE)
> Model <- lm(nox ~ dis + logdis + dissquare + indus + tax, BostonHousing)
> variance <- (A %*% vcov(Model) %*% t(A))
> E <- eigen(variance, TRUE)
> Evalues <- E$values
> Evectors <-E$vectors
> sqrtvariance <- Evectors %*% diag(1/sqrt(Evalues)) %*% t(Evectors)
> Z <- sqrtvariance %*% A %*% coef(Model)
> F <- sum(Z^2)/3
> F
[1] 42.80353
# Question 4:
> Medv<-log(medv)
> Rm<-(rm)^2
> Dis<-log(dis)
> forward11<-lm(Medv~1)
> summary(forward11)
Call:
lm(formula = Medv ~ 1)
Residuals:
Min 1Q Median 3Q Max
-1.42507 -0.19983 0.01949 0.18436 0.87751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.03451 0.01817 167 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4088 on 505 degrees of freedom
> forward12<-lm(Medv~Rm-1)
> summary(forward12)
Call:
lm(formula = Medv ~ Rm - 1)
20
Residuals:
Min 1Q Median 3Q Max
-2.5860 -0.1694 0.1560 0.4042 2.3811
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Rm 0.0735845 0.0005646 130.3 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5208 on 505 degrees of freedom
Multiple R-squared: 0.9711, Adjusted R-squared: 0.9711
F-statistic: 1.699e+04 on 1 and 505 DF, p-value: < 2.2e-16
> forward13<-lm(Medv~age-1)
> summary(forward13)
Call:
lm(formula = Medv ~ age - 1)
Residuals:
Min 1Q Median 3Q Max
-2.0839 -0.5927 0.3357 1.5142 3.4463
Coefficients:
Estimate Std. Error t value Pr(>|t|)
age 0.0369330 0.0008236 44.84 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.373 on 505 degrees of freedom
Multiple R-squared: 0.7993, Adjusted R-squared: 0.7989
F-statistic: 2011 on 1 and 505 DF, p-value: < 2.2e-16
> forward14<-lm(Medv~Dis-1)
> summary(forward14)
Call:
lm(formula = Medv ~ Dis - 1)
Residuals:
Min 1Q Median 3Q Max
-2.2240 -0.4238 0.6628 1.2085 3.6475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
21
Dis 2.17068 0.03972 54.66 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.165 on 505 degrees of freedom
Multiple R-squared: 0.8554, Adjusted R-squared: 0.8551
F-statistic: 2987 on 1 and 505 DF, p-value: < 2.2e-16
> forward21<-lm(Medv~Rm)
> summary(forward21)
Call:
lm(formula = Medv ~ Rm)
Residuals:
Min 1Q Median 3Q Max
-1.20269 -0.10530 0.06992 0.17255 1.31948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.878478 0.063036 29.8 <2e-16 ***
Rm 0.028909 0.001537 18.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3137 on 504 degrees of freedom
Multiple R-squared: 0.4123, Adjusted R-squared: 0.4112
F-statistic: 353.6 on 1 and 504 DF, p-value: < 2.2e-16
> forward22<-lm(Medv~age)
> summary(forward22)
Call:
lm(formula = Medv ~ age)
Residuals:
Min 1Q Median 3Q Max
-1.21816 -0.20280 -0.01733 0.16722 1.08442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4860274 0.0427295 81.58 <2e-16 ***
age -0.0065843 0.0005765 -11.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
22
Residual standard error: 0.3647 on 504 degrees of freedom
Multiple R-squared: 0.2056, Adjusted R-squared: 0.204
F-statistic: 130.4 on 1 and 504 DF, p-value: < 2.2e-16
> forward23<-lm(Medv~Dis)
> summary(forward23)
Call:
lm(formula = Medv ~ Dis)
Residuals:
Min 1Q Median 3Q Max
-1.18240 -0.21227 -0.02365 0.16558 1.20522
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.66935 0.04024 66.338 <2e-16 ***
Dis 0.30737 0.03084 9.965 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.374 on 504 degrees of freedom
Multiple R-squared: 0.1646, Adjusted R-squared: 0.163
F-statistic: 99.31 on 1 and 504 DF, p-value: < 2.2e-16
> forward31<-lm(Medv~Rm+Dis)
> summary(forward31)
Call:
lm(formula = Medv ~ Rm + Dis)
Residuals:
Min 1Q Median 3Q Max
-1.05461 -0.12689 0.03383 0.16131 1.46235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.746011 0.061332 28.468 < 2e-16 ***
Rm 0.026088 0.001484 17.585 < 2e-16 ***
Dis 0.206437 0.024965 8.269 1.21e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2946 on 503 degrees of freedom
Multiple R-squared: 0.4827, Adjusted R-squared: 0.4806
F-statistic: 234.6 on 2 and 503 DF, p-value: < 2.2e-16
23
> forward32<-lm(Medv~Rm+age)
> summary(forward32)
Call:
lm(formula = Medv ~ Rm + age)
Residuals:
Min 1Q Median 3Q Max
-1.0789 -0.1094 0.0335 0.1300 1.4183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3346303 0.0726764 32.12 <2e-16 ***
Rm 0.0256312 0.0014361 17.85 <2e-16 ***
age -0.0047407 0.0004632 -10.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2856 on 503 degrees of freedom
Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117
F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16
> forward41<-lm(Medv~Rm+Dis+age)
> summary(forward41)
Call:
lm(formula = Medv ~ Rm + Dis + age)
Residuals:
Min 1Q Median 3Q Max
-1.06502 -0.11534 0.02519 0.13058 1.43388
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2520854 0.1061091 21.224 < 2e-16 ***
Rm 0.0254895 0.0014420 17.676 < 2e-16 ***
Dis 0.0402145 0.0376701 1.068 0.286
age -0.0041510 0.0007209 -5.758 1.48e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2856 on 502 degrees of freedom
Multiple R-squared: 0.5147, Adjusted R-squared: 0.5118
F-statistic: 177.5 on 3 and 502 DF, p-value: < 2.2e-16
24
> forward<-lm(Medv~Rm+age)
> summary(forward)
Call:
lm(formula = Medv ~ Rm + age)
Residuals:
Min 1Q Median 3Q Max
-1.0789 -0.1094 0.0335 0.1300 1.4183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3346303 0.0726764 32.12 <2e-16 ***
Rm 0.0256312 0.0014361 17.85 <2e-16 ***
age -0.0047407 0.0004632 -10.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2856 on 503 degrees of freedom
Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117
F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16
# Question 5:
> backward11<-lm(Medv ~Rm+ age+ Dis)
> summary(backward11)
Call:
lm(formula = Medv ~ Rm + age + Dis)
Residuals:
Min 1Q Median 3Q Max
-1.06502 -0.11534 0.02519 0.13058 1.43388
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2520854 0.1061091 21.224 < 2e-16 ***
Rm 0.0254895 0.0014420 17.676 < 2e-16 ***
age -0.0041510 0.0007209 -5.758 1.48e-08 ***
Dis 0.0402145 0.0376701 1.068 0.286
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2856 on 502 degrees of freedom
Multiple R-squared: 0.5147, Adjusted R-squared: 0.5118
F-statistic: 177.5 on 3 and 502 DF, p-value: < 2.2e-16
25
> backward21<-lm(Medv ~Rm+ age)
> summary(backward21)
Call:
lm(formula = Medv ~ Rm + age)
Residuals:
Min 1Q Median 3Q Max
-1.0789 -0.1094 0.0335 0.1300 1.4183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3346303 0.0726764 32.12 <2e-16 ***
Rm 0.0256312 0.0014361 17.85 <2e-16 ***
age -0.0047407 0.0004632 -10.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2856 on 503 degrees of freedom
Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117
F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16
# Question 6:
> data("BostonHousing",package="mlbench")
> BostonHousing <- transform(BostonHousing, logdis = log(dis))
> BostonHousing <- transform(BostonHousing, logmedv = log(medv))
> BostonHousing <- transform(BostonHousing, rmsq = rm*rm)
> attach(BostonHousing)
> n <- 506
> m1<-lm(logmedv~1)
> m2<-lm(logmedv~rmsq-1)
> m3<-lm(logmedv~age-1)
> m4<-lm(logmedv~logdis-1)
> m5<-lm(logmedv~rmsq+logdis-1)
> m6<-lm(logmedv~logdis+age-1)
> m7<-lm(logmedv~rmsq+age-1)
> m8<-lm(logmedv~age)
> m9<-lm(logmedv~rmsq)
> m10<-lm(logmedv~logdis)
> m11<-lm(logmedv~rmsq+logdis)
> m12<-lm(logmedv~rmsq+age)
> m13<-lm(logmedv~logdis+age)
> m14<-lm(logmedv~rmsq+logdis+age-1)
> m15<-lm(logmedv~rmsq+logdis+age)
26
> m16<-lm(logmedv~-1)
>
> AIC1=n*log(sum(m1$residuals^2)/n)+2*1
> AIC2=n*log(sum(m2$residuals^2)/n)+2*1
> AIC3=n*log(sum(m3$residuals^2)/n)+2*1
> AIC4=n*log(sum(m4$residuals^2)/n)+2*1
> AIC5=n*log(sum(m5$residuals^2)/n)+2*2
> AIC6=n*log(sum(m6$residuals^2)/n)+2*2
> AIC7=n*log(sum(m7$residuals^2)/n)+2*2
> AIC8=n*log(sum(m8$residuals^2)/n)+2*2
> AIC9=n*log(sum(m9$residuals^2)/n)+2*2
> AIC10=n*log(sum(m10$residuals^2)/n)+2*2
> AIC11=n*log(sum(m11$residuals^2)/n)+2*3
> AIC12=n*log(sum(m12$residuals^2)/n)+2*3
> AIC13=n*log(sum(m13$residuals^2)/n)+2*3
> AIC14=n*log(sum(m14$residuals^2)/n)+2*3
> AIC15=n*log(sum(m15$residuals^2)/n)+2*4
> AIC16=n*log(sum(m16$residuals^2)/n)+2*0
>
>
> BIC1=n*log(sum(m1$residuals^2)/n)+log(n)*1
> BIC2=n*log(sum(m2$residuals^2)/n)+log(n)*1
> BIC3=n*log(sum(m3$residuals^2)/n)+log(n)*1
> BIC4=n*log(sum(m4$residuals^2)/n)+log(n)*1
> BIC5=n*log(sum(m5$residuals^2)/n)+log(n)*2
> BIC6=n*log(sum(m6$residuals^2)/n)+log(n)*2
> BIC7=n*log(sum(m7$residuals^2)/n)+log(n)*2
> BIC8=n*log(sum(m8$residuals^2)/n)+log(n)*2
> BIC9=n*log(sum(m9$residuals^2)/n)+log(n)*2
> BIC10=n*log(sum(m10$residuals^2)/n)+log(n)*2
> BIC11=n*log(sum(m11$residuals^2)/n)+log(n)*3
> BIC12=n*log(sum(m12$residuals^2)/n)+log(n)*3
> BIC13=n*log(sum(m13$residuals^2)/n)+log(n)*3
> BIC14=n*log(sum(m14$residuals^2)/n)+log(n)*3
> BIC15=n*log(sum(m15$residuals^2)/n)+log(n)*4
> BIC16=n*log(sum(m16$residuals^2)/n)+log(n)*0
>
min(AIC1,AIC2,AIC3,AIC4,AIC5,AIC6,AIC7,AIC8,AIC9,AIC10,AIC11,AIC12,AIC13,AI
C14,AIC15,AIC16)
[1] -1265.073
> AIC12
[1] -1265.073
>
min(BIC1,BIC2,BIC3,BIC4,BIC5,BIC6,BIC7,BIC8,BIC9,BIC10,BIC11,BIC12,BIC13,BIC
14,BIC15,BIC16)
[1] -1252.394
27
> BIC12
[1] -1252.394
# Question 7:
> data("BostonHousing",package="mlbench")
> BostonHousing <- transform(BostonHousing, logdis = log(dis))
> BostonHousing <- transform(BostonHousing, logmedv = log(medv))
> BostonHousing <- transform(BostonHousing, rmsq = rm*rm)
> attach(BostonHousing)
>
> Group1 <- subset(BostonHousing,BostonHousing$zn!=55.0)
> Group2 <- subset(BostonHousing,BostonHousing$zn==55.0)
> fitmodel <- lm(logmedv~rmsq+age,data = Group1)
> summary(fitmodel)
Call:
lm(formula = logmedv ~ rmsq + age, data = Group1)
Residuals:
Min 1Q Median 3Q Max
-1.07887 -0.10964 0.03389 0.13020 1.41838
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3360096 0.0729281 32.03 <2e-16 ***
rmsq 0.0256286 0.0014406 17.79 <2e-16 ***
age -0.0047542 0.0004661 -10.20 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2864 on 500 degrees of freedom
Multiple R-squared: 0.5129, Adjusted R-squared: 0.5109
F-statistic: 263.2 on 2 and 500 DF, p-value: < 2.2e-16
> p <- predict(fitmodel,newdata=Group2)
> SSPE <- sum((Group2$logmedv-p)^2)
> SSPE
[1] 0.02835043

Mais conteúdo relacionado

Destaque

Masters Doc-student copy and temporary for true copy
Masters Doc-student copy and temporary for true copyMasters Doc-student copy and temporary for true copy
Masters Doc-student copy and temporary for true copyAlemu Tadesse
 
ramashankar mishra rewa doar adiwasi sammelan
ramashankar mishra rewa doar adiwasi sammelanramashankar mishra rewa doar adiwasi sammelan
ramashankar mishra rewa doar adiwasi sammelanYograj Tiwari
 
COTSEmbeddedSystems.ADT-Oct2016
COTSEmbeddedSystems.ADT-Oct2016COTSEmbeddedSystems.ADT-Oct2016
COTSEmbeddedSystems.ADT-Oct2016Earle Olson
 
CV- Allam Abuhuson updated
CV- Allam Abuhuson updatedCV- Allam Abuhuson updated
CV- Allam Abuhuson updatedAllam Hasan
 
E. Karlichek - Letter of Rec 5 (KC)
E. Karlichek - Letter of Rec 5 (KC)E. Karlichek - Letter of Rec 5 (KC)
E. Karlichek - Letter of Rec 5 (KC)Emily Karlichek
 
Revolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajo
Revolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajoRevolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajo
Revolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajoAlfredo Vela Zancada
 
La Sociedad en Red. Informa anual 2015
La Sociedad en Red. Informa anual 2015La Sociedad en Red. Informa anual 2015
La Sociedad en Red. Informa anual 2015Alfredo Vela Zancada
 
MEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIA
MEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIAMEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIA
MEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIAALEXANDRE PANTOJA
 
Conegliano capitale del prosecco superiore
Conegliano capitale del prosecco superioreConegliano capitale del prosecco superiore
Conegliano capitale del prosecco superioreMichael Mazzer
 
웹 개발 스터디 02 - javascript, bootstrap
웹 개발 스터디 02 - javascript, bootstrap웹 개발 스터디 02 - javascript, bootstrap
웹 개발 스터디 02 - javascript, bootstrapYu Yongwoo
 
웹 개발 스터디 01 - PHP 파일 업로드, 다운로드
웹 개발 스터디 01 - PHP 파일 업로드, 다운로드웹 개발 스터디 01 - PHP 파일 업로드, 다운로드
웹 개발 스터디 01 - PHP 파일 업로드, 다운로드Yu Yongwoo
 

Destaque (16)

Masters Doc-student copy and temporary for true copy
Masters Doc-student copy and temporary for true copyMasters Doc-student copy and temporary for true copy
Masters Doc-student copy and temporary for true copy
 
MONGOL EMPIRE
MONGOL EMPIRE MONGOL EMPIRE
MONGOL EMPIRE
 
Image076.jpg
Image076.jpgImage076.jpg
Image076.jpg
 
ramashankar mishra rewa doar adiwasi sammelan
ramashankar mishra rewa doar adiwasi sammelanramashankar mishra rewa doar adiwasi sammelan
ramashankar mishra rewa doar adiwasi sammelan
 
Ineternet
IneternetIneternet
Ineternet
 
COTSEmbeddedSystems.ADT-Oct2016
COTSEmbeddedSystems.ADT-Oct2016COTSEmbeddedSystems.ADT-Oct2016
COTSEmbeddedSystems.ADT-Oct2016
 
CV- Allam Abuhuson updated
CV- Allam Abuhuson updatedCV- Allam Abuhuson updated
CV- Allam Abuhuson updated
 
hawaii123
hawaii123hawaii123
hawaii123
 
Gsdfgsdgs
GsdfgsdgsGsdfgsdgs
Gsdfgsdgs
 
E. Karlichek - Letter of Rec 5 (KC)
E. Karlichek - Letter of Rec 5 (KC)E. Karlichek - Letter of Rec 5 (KC)
E. Karlichek - Letter of Rec 5 (KC)
 
Revolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajo
Revolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajoRevolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajo
Revolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajo
 
La Sociedad en Red. Informa anual 2015
La Sociedad en Red. Informa anual 2015La Sociedad en Red. Informa anual 2015
La Sociedad en Red. Informa anual 2015
 
MEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIA
MEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIAMEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIA
MEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIA
 
Conegliano capitale del prosecco superiore
Conegliano capitale del prosecco superioreConegliano capitale del prosecco superiore
Conegliano capitale del prosecco superiore
 
웹 개발 스터디 02 - javascript, bootstrap
웹 개발 스터디 02 - javascript, bootstrap웹 개발 스터디 02 - javascript, bootstrap
웹 개발 스터디 02 - javascript, bootstrap
 
웹 개발 스터디 01 - PHP 파일 업로드, 다운로드
웹 개발 스터디 01 - PHP 파일 업로드, 다운로드웹 개발 스터디 01 - PHP 파일 업로드, 다운로드
웹 개발 스터디 01 - PHP 파일 업로드, 다운로드
 

Semelhante a Analysis of Boston Housing Data Models

Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayCrystal Alvarez
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptxmesfin69
 
Trends in Computer Science and Information Technology
Trends in Computer Science and Information TechnologyTrends in Computer Science and Information Technology
Trends in Computer Science and Information Technologypeertechzpublication
 
REGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HEREREGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HEREShriramKargaonkar
 
ProjectWriteupforClass (3)
ProjectWriteupforClass (3)ProjectWriteupforClass (3)
ProjectWriteupforClass (3)Jeff Lail
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docxhyacinthshackley2629
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docxnovabroom
 
Energy efficiency dataset
Energy efficiency datasetEnergy efficiency dataset
Energy efficiency datasetAnkit Ghosalkar
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data scienceNikhil Jaiswal
 

Semelhante a Analysis of Boston Housing Data Models (20)

Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis Essay
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricing
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricing
 
Correlation
CorrelationCorrelation
Correlation
 
Ch14 multiple regression
Ch14 multiple regressionCh14 multiple regression
Ch14 multiple regression
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptx
 
Characteristics and simulation analysis of nonlinear correlation coefficient ...
Characteristics and simulation analysis of nonlinear correlation coefficient ...Characteristics and simulation analysis of nonlinear correlation coefficient ...
Characteristics and simulation analysis of nonlinear correlation coefficient ...
 
Chap04 01
Chap04 01Chap04 01
Chap04 01
 
Math n Statistic
Math n StatisticMath n Statistic
Math n Statistic
 
Regression -Linear.pptx
Regression -Linear.pptxRegression -Linear.pptx
Regression -Linear.pptx
 
Group5
Group5Group5
Group5
 
Trends in Computer Science and Information Technology
Trends in Computer Science and Information TechnologyTrends in Computer Science and Information Technology
Trends in Computer Science and Information Technology
 
200994363
200994363200994363
200994363
 
REGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HEREREGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HERE
 
ProjectWriteupforClass (3)
ProjectWriteupforClass (3)ProjectWriteupforClass (3)
ProjectWriteupforClass (3)
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
 
Energy efficiency dataset
Energy efficiency datasetEnergy efficiency dataset
Energy efficiency dataset
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data science
 

Analysis of Boston Housing Data Models

  • 1. 1 ysstats@bu.edu U37074009 Analysis of the Boston Housing Data from the 1970 census: Diverse Tests and Model Selection Processes regarding the Variables in Boston Housing Data Shuai Yuan1 December 8, 2016 Abstract In this project, we study the Boston Housing Data that was offered by Harrison and Rubinfeld, 1978. The data contained many different variables that related to Boston Housing for 506 tracts of Boston from the 1970 census. The data is included in the R package mlbench. Using the data and R software, we first study the scatterplot matrix and the correlation of different variables to find their relations briefly. Then, we make various tests for many null hypotheses to examine the properties of different models. Finally, we perform the model selections by using different methods such as the forward algorithm, backward algorithm as well as the AIC and BIC criterion to find and analyze the most fitted model for our data sets. And the same time, we also compute the SSPE for our subset of data.
  • 2. 1 Contents 1 Introduction 2 2 Analysis 3 2.1 Analysis of the linearity between variables 3 2.1.1 Scatterplot matrix for variables 3 2.1.2 Explanation of Correlation between two variables 4 2.2 The statistical tests for the Null Hypotheses of the fitted model 6 2.3 Model selection by using the forward algorithm 9 2.4 Model selection by using the backward algorithm 11 2.5 Model selection by using the AIC and BIC criterion 12 2.6 Analysis of the related statistics 14 2.6.1 Fit the model by using the subset of the data 14 2.6.2 Compute and analyze the SSPE for subset of the data 14 3 Conclusion 16 4 Appendix 18
  • 3. 2 1 Introduction The data of the Boston Housing from the 1970 census are used in this project. The dataset contains 14 variables with 506 observations. The data is included in the R package mlbench. In this project, we used various tools to analyze the Boston Housing data and the most frequently used method is the linear regression. At the same time, we also used hypothesis testing, t-test, F- test as well as model selection as our methods to analyze the properties of the related data. Using the data and R software, we first study the scatterplot matrix and the correlation of different variables to find their relations briefly. Then, we make various tests for many null hypotheses to examine the properties of different models. Finally, we perform the model selections by using different methods such as the forward algorithm, backward algorithm as well as the AIC and BIC criterion to find and analyze the most fitted model for our data sets. And the same time, we also compute the SSPE for our subset of data. The outline for the remainder of the paper is as follows. In Section 2, we provide the main results and analysis towards the multiple aspects of our topics. Section 3 concludes. In the Appendix, we provide our R codes as well as the related outputs. Finally, we also provide the references that we use in this project. To be specific, the part 2.1.1 is for the question 1, part 2.1.2 is for the question 2, part 2.2 is for the question 3, part 2.3 is for the question 4, part 2.4 is for the question 5, part 2.5 is for the question 6, part 2.6 is for the question 7.
  • 4. 3 2 Analysis To get a briefly understanding of the relationships between different variables at the very beginning, we get the scatterplot matrix of these different variables and find the non-linearity between these variables. Therefore, the correlation of these variables may not appropriate for describing the relationships within the variables. At the same time, we also compute different test statistics and test many hypotheses for the general model. Moreover, we also perform variable selection using forward algorithm, backward algorithm, AIC and BIC criterion. We find that both criterions select the same model for us and we explain the reason why the selected model is the one that we need. Finally, we also get the fitted model for subset and compute and compare the SSPE of the selected models. 2.1 Analysis of the linearity between variables 2.1.1 Scatterplot matrix for variables First, according to the description of the R Package “mlbench”, we can get the meaning of the following variables as well as the scatterplot matrix for these four variables which are listed below: 𝒏𝒐𝒙: Nitric oxides concentration (parts per 10 million). 𝒊𝒏𝒅𝒖𝒔: Proportion of residential land zoned for lots over 25,000 sq.ft. 𝒅𝒊𝒔: Weighted distances to five Boston employment centers. 𝒕𝒂𝒙: Full-value property-tax rate per USD 10,000. plot 1 Scatterplot matrix for the variables nox, indus, dis, tax
  • 5. 4 According to the scatterplot matrix, we can find that these four variables are all related in some patterns. For instance, generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable 𝑑𝑖𝑠 and the variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand, generally speaking, the relationships between other variables are positive at the low volume level, while the relationships may get vague and non-related at the high volume level. On the other hand, we can also find the possible explanations according to the meanings of these variables. Because the variable 𝑛𝑜𝑥 means the “Nitric oxides concentration (parts per 10 million).”, which also represents the degree of air pollution in this area. For the variable 𝑑𝑖𝑠, it means the “Weighted distances to five Boston employment centers.”, which also represents the degree of living away from the downtown. And for the variable 𝑖𝑛𝑑𝑢𝑠 , it means the “Proportion of residential land zoned for lots over 25,000 sq.ft.”, which also represents the level of economy of residents. Because only if when people own enough money, will they use their money to build their own parking lots which are also quite wide. Therefore, we can find the possible explanations for these relationships. As we all know, the air of the area that far away from the downtown is better because there are more trees and therefore, the level of pollution there can be at a low level. So it is reason to see that there is negative relationship between the variable 𝑛𝑜𝑥 and 𝑑𝑖𝑠. On the other hand, the degree of economic development in the areas that far away from the downtown is worse than that of the downtown areas and therefore, the proportion of residential land zoned for large lots is smaller than that of the downtown areas. So it is reason to see that there is negative relationship between the variable 𝑖𝑛𝑑𝑢𝑠 and 𝑑𝑖𝑠. 2.1.2 Explanation of Correlation between two variables We know that the formula of correlation coefficient between two variables is that: ρ23 = Cov(X, Y) D(X) D(Y) Therefore, according to the R codes, we can find that the correlation between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠 is about -0.7692301, which may give us the information that these two variables are negatively correlated.
  • 6. 5 However, the thing we should not forget is that the correlation coefficient between two variables is used for examining the relationship for linear regression model, or in other words, the linear relationship between two variables. But we can find from the scatterplot that the relationship between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠 is more likely as exponential relationship, which means that there is not reasonable to use the correlation coefficient between these two variables to examine the relationship between them. On the other hand, we can also test their relationship of them by getting the model between them. From the model, we assume that there is an exponential relation between them and we get the significantly p-value for this model. Therefore, according to the discussion above, we can safely draw the conclusion that we cannot use the correlation between these two variables to quantify the strength of relationship between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠.
  • 7. 6 2.2 The statistical tests for the Null Hypotheses of the fitted model For this question, the full model given only contains five variables and intercept. 𝛽?means the intercept, 𝛽@ measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑑𝑖𝑠 increased, 𝛽A measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑙𝑜𝑔(𝑑𝑖𝑠) increases, 𝛽D measure the change of the variable 𝑛𝑜𝑥 if one unit of variable 𝑑𝑖𝑠^2 increases, 𝛽H measure the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑖𝑛𝑑𝑢𝑠 increases, 𝛽I measure the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑡𝑎𝑥 increases. Since three of them have already be given in the data set, we just need to transform and add the left two variables which are log(dis) and dis^2. So we create the new variables whose names are log(dis) and dissquare and will use them to refer the variable log(dis) and the variable dis^2. For this section, since we want to decide whether or not specified parameters are equal to 0 or each other, we will do the F-test for all of three sub-questions. At the beginning, we have the formula for F-test as below: 𝐹 = ( 𝑅𝑆𝑆YZ − 𝑅𝑆𝑆Z 𝑑𝑓YZ − 𝑑𝑓Z )/( 𝑅𝑆𝑆Z 𝑑𝑓Z ) The table for the summary of F-test value and corresponding p-value for the question are summarized below: question a question b question c F-test(value) 5.911 6.0524 42.80353 p-value 0.0154 0.002528 0.0001 Table 1 The F-test(value) and p-value of question a, b, c We will use it to evaluate the question. Question a: According to the definition of the null hypothesis test, the main idea is to test whether the coefficient of the variable log(dis) is equal to 0 or not. Since the variable log(dis) is the only target we want to focus here, we can just build a new regression model which does not contain the variable log(dis) to compare with the original regression model. When we compare two regression models, we will do the F-test to see if they are significantly different with each other. Form the
  • 8. 7 results of the R codes, the F-value is 5.911 and corresponding p-value is 0.0154. To see whether we need to reject the null hypothesis, it depends on significant level of alpha. Here, we set the value of alpha to 0.05. Since the p-value is smaller than 0.05, we will reject the null hypothesis and conclude that 𝛽A is not equal to 0 at the 95% confidence level. However, if we want to be 99% confidence about the result, the alpha will change to be 0.01. Since the p-value is bigger than 0.01 here. We cannot reject the null hypothesis based at the 99% confidence level. Question b: For part b, we want to make sure whether the coefficient of the variable dis and the variable dis^2 are equal to 0 or not. Since it only focuses on two variable and want to make sure if they are significantly different from 0. We can do the similar test as part a. For this question, we will build another regression model which only contains intercept and three variables except the variable dis^2 and the variable dis. Then we compare the new regression model with the original full model to see if they are significantly different. We will also do a F-test to compare the two models. Here the null hypothesis is that 𝛽@ = 𝛽D = 0, the alternative hypothesis is that as least one of them is not equal to 0. The value for F-test is 6.0524 and corresponding p-value is 0.002528. Similarly, we will also set the alpha to be 0.05 here. Since the p-value is smaller than 0.05, we will reject the null hypothesis and conclude that among the 𝛽@, 𝛽D, at least one of them is not equal to 0. Question c: Situation for part c is much different. Since the question want to make sure if 𝛽A = 𝛽D = 0 and if 𝛽H = 𝛽I. We will not use the tradition way as above but use matrix to get the solution. We will divide the first section(𝛽A = 𝛽D = 0) as if 𝛽A = 0 and if 𝛽D = 0. So the first line of matrix A has only a “1” corresponding to the position of 𝛽A and “0” for all the other variables. For the second line of matrix A, it only has a “1” corresponding to the position of 𝛽D and “0” for all the other variables. When we compute the matrix A times the variable matrix, the first two line we can get is only 𝛽A and 𝛽D. To make sure whether or not 𝛽H = 𝛽I. We will put “1” corresponding to the position of 𝛽H and “-1” corresponding to the position of 𝛽I for the third line of matrix A. So the output of third line will become 𝛽H - 𝛽I. To test if each of the result we get equal to “0”, we will make a F-test here. The value for F-test is 42.80353 and p-value corresponding to it is less than 0.0001. Assuming we set alpha = 0.05 here, apparently the p is smaller than alpha, so we will reject
  • 9. 8 the null hypothesis here and conclude that at least one of 𝛽A = 𝛽D is not equal to 0 or 𝛽H is not equal to 𝛽I.
  • 10. 9 2.3 Model selection by using the forward algorithm In this section, we will use the method of forward algorithm to analyze the relationship between response variable and potential explanatory variables below. Moreover, according to the question’s requirements, we have transformed the original variables to different formats, which are all presented below. Response variable: 𝐥𝐨𝐠(𝐦𝐞𝐝𝐯), which means that we now use the natural logarithm of the median value of owner- occupied homes in $1000's. Potential explanatory variables: 𝐫𝐦^𝟐, which means the square of average number of rooms per dwelling. 𝐥𝐨𝐠(𝐝𝐢𝐬), which means the natural logarithm of weighted distances to five Boston employment centers. 𝐚𝐠𝐞, which means the proportion of owner-occupied units built prior to 1940. We performed variable selection using a forward algorithm with a significance level of 5%. For the forward algorithm, we regressed the models with all variables separately. We name the models from “forward11” to “forward14”, which you can find with details in Appendix. The results of the regressions were summarized as the following: name model variable t - value Pr(>|t|) forward11 log(medv) ~ 1 intercept 167 <2e-16 forward12 log(medv) ~ rm^2 - 1 rm^2 130 <2e-16 forward13 log(medv) ~ age - 1 age 44.84 <2e-16 forward14 log(medv) ~ log(dis) - 1 log(dis) 54.66 <2e-16 Table 2 The summary of different models from forward11 to forward14 We could observe from the table that while all the variables are significant, the intercept has the largest t-value. Hence, we chose the intercept to our model. Next, we regressed the intercept with each of the left three variables in the models called “foward21” to “foward23”. The summarized results are shown in the table below:
  • 11. 10 name model variable t - value Pr(>|t|) forward21 log(medv) ~ rm^2 rm^2 18.8 <2e-16 forward22 log(medv) ~ age age 11.42 <2e-16 forward23 log(medv) ~ log(dis) log(dis) 9.965 <2e-16 Table 3 The summary of different models from forward21 to forward23 As shown in the table, the p-values of all the variables are significant. However, comparing to the other variables, the t-value of the variable rm^2 has the largest one. So we added rm^2. Then, we tested the combination of the variables rm^2, dis, age and intercept separately in the models, which named as “forward31” and “forward32”. We got the following table as below: name model variable t - value Pr(>|t|) forward31 log(medv) ~ rm^2 + log(dis) log(dis) 8.269 1.21e-15 forward32 log(medv) ~ rm^2 + age age -10.23 <2e-16 Table 4 The summary of different models from forward31 to forward32 From the result above, the p-value of all the other variables are significant but the variable age has smaller p-value than the variable dis. Therefore, we also add the variable age to our model. Finally, we regressed the response variable log(medv) on all of the variables in the following model “forward41”. name model variable t - value Pr(>|t|) forward41 log(medv) ~ rm^2 + age + log(dis) log(dis) 1.068 0.286 Table 4 The summary of different models from forward41 Based on the table above, the variable log(dis) is not significant in the model and thus, we removed it from our model. Therefore, after the forward algorithm selection, our final model is shown as the following, log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A + 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀 which ε is the error term.
  • 12. 11 2.4 Model selection by using the backward algorithm In this section, we will use the method of backward algorithm to analyze the relationship between response variable and potential explanatory variables below. Moreover, according to the question’s requirements, we used the transformed formats of the original variables that were defined in the previous section. We performed variable selection using a backward algorithm with a significance level of 5%. For the backward algorithm, we regressed the models with all variables at first. We named the models from “backward11”, which you can find with details in Appendix. The results of the regressions were summarized as the following: name model variable t - value Pr(>|t|) backward11 log(medv) ~ rm^2 + age + log(dis) intercept 21.224 <2e-16 rm^2 17.676 <2e-16 age -5.758 1.48e-08 log(dis) 1.068 0.286 Table 5 The summary of different models from backward11 Based on the result, we can find that except the variable log(dis) whose t-value is 1.068 and p- value is 0.286, all the explanatory variables are significant. Thus, we removed the variable dis and built a new model with the left variables, which is called “backward21”. Here are the results: name model variable t - value Pr(>|t|) backward21 log(medv) ~ rm^2 + age intercept 32.12 <2e-16 rm^2 17.85 <2e-16 age -10.23 <2e-16 Table 6 The summary of different models from backward21 After deleting the variable log(dis) from the model, we got left variables are all significant and thus, we ended up with the model “backward21”. We got the same model as that by the process of forward algorithm, log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A + 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀 which ε is the error term.
  • 13. 12 2.5 Model selection by using the AIC and BIC criterion First of all, we can do a preliminary analysis to the full model we are interested. In the full linear regression model, the t-value and p-value are used to determine whether each of the variable is significant for the model. Setting alpha = 0.05 here, we can see easily that the three of the variables rm^2, age and intercept have smallest p-value that also less than 0.05 which means they are significant. However, the variable log(dis) has the p-value of 0.286 which is not significant at all. In this section, we will try to perform variable selection using AIC and BIC criterion. The definition for AIC is that the measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection. At the same time, the definition for BIC is that the criterion for model selection among a finite set of models and the model with the lowest BIC score is preferred. And the formulas for AIC and BIC are shown as below, 𝐴𝐼𝐶(𝑚) = 𝑛 ∗ log 𝑅𝑆𝑆 𝑚 𝑛 + 2 ∗ 𝑚v 𝐵𝐼𝐶(𝑚) = 𝑛 ∗ log 𝑅𝑆𝑆 𝑚 𝑛 + log (𝑛) ∗ 𝑚v where 𝑚 is the regression model, 𝑛 is the sample size, 𝑚v denotes the number of variables in the model 𝑚. In the project, the sample size is 506 and all we need to do is to put all possible regression model into the R software to compute the corresponding AIC and BIC scores. The candidate models of the different regression model are summarized as below:
  • 14. 13 Candidate Models AIC Score BIC Score log(medv) ~ 1 -904.371 -900.145 log(medv) ~ rm^2 - 1 -659.289 -655.063 log(medv) ~ age - 1 321.927 326.154 log(medv) ~ log(dis) - 1 155.969 160.195 log(medv) ~ rm^2 + log(dis) - 1 -750.189 -741.736 log(medv) ~ log(dis) + age - 1 -533.471 -525.018 log(medv) ~ rm^2 + age - 1 -702.556 -694.102 log(medv) ~ age -1018.83 -1010.378 log(medv) ~ rm^2 -1171.36 -1162.907 log(medv) ~ log(dis) -993.379 -984.926 log(medv) ~ rm^2 + log(dis) -1233.86 -1221.175 log(medv) ~ rm^2 + age -1265.07 -1252.394 log(medv) ~ log(dis) + age -1021.36 -1008.683 log(medv) ~ rm^2 + log(dis) + age - 1 -940.149 -929.47 log(medv) ~ rm^2 + log(dis) + age -1264.22 -1247.315 log(medv) ~ -1 1132.453 1132.453 Table 7 The AIC and BIC scores of all possible models From the table above, we can find that the regression model with smallest AIC score has variable rm^2, age as well as the intercept. The regression model with the smallest BIC score is the same model. And when we checked the regression model, we can find the model only contains the variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the model, all the variables in the model are significant. So we will select the model which contains the variables rm^2, age and intercept under the AIC as well as BIC criterion.
  • 15. 14 2.6 Analysis of the related statistics 2.6.1 Fit the model by using the subset of the data According to the results above, we finally choose the model of “m12”, which has the minimum value of BIC, to be used as our fitted model. From the question 6, we can find that the fitted model can be written as the following, log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A + 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀 which ε is the error term. Therefore, we can now use the data from Group1 to fit the above model. From the results generated by R section, we can find that the fitted model is shown as the following: log medv = 2.3360 + 0.0256 ∗ 𝑟𝑚A − 0.0048 ∗ 𝑎𝑔𝑒 Moreover, the p-values of all the explanatory variables are all significant at all level. 2.6.2 Compute and analyze the SSPE for subset of the data On the other hand, we can also apply another method which is called the Cross-Validation to further analyze the model selection process. For this method, we need to apply the following processes. First, we split the data into two different subsets according to a user defined criterion, Group1 and Group2, which are also called the training data and the validation data. Second, we fit the model using the data from the Group1. Third, based on the data from the Group2, we make the prediction for the response variable log (𝑚𝑒𝑑𝑣)€. At the same time, we also denote the predicted value by log (𝑚𝑒𝑑𝑣)•. At last, we compute the value of SSPE, which is also called “Sum of Squared Prediction Error”. Therefore, according to the question, we first divided the original data set “BostonHousing” into two Groups, which is the Group1 and the Group2 respectively. And then, we can get the SSPE of the Group2 by computing the SSPE according to its definition, which is shown as below: SSPE = (log (𝑚𝑒𝑑𝑣)€ − log (𝑚𝑒𝑑𝑣)•)A v €…@ In the equation above, log (𝑚𝑒𝑑𝑣)€ denotes the response variables in the Group2 and log (𝑚𝑒𝑑𝑣)• denotes the predicted values of the response variable, which were computed by the prediction function in R section. Therefore, we can compute the SSPE of the Group2 as 0.02835043. At the same time, we can find that the model we get from the part2.4(question 5) is that,
  • 16. 15 log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A + 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀 which ε is the error term. And the model is the same as we get from the part2.5(question 6). Therefore, we get the same results for the same model.
  • 17. 16 3 Conclusion In this project, we first got the scatterplot matrix of four different variables, 𝑛𝑜𝑥, 𝑖𝑛𝑑𝑢𝑠, 𝑑𝑖𝑠 and 𝑡𝑎𝑥. According to the scatterplot matrix, we found that these four variables are all related in some patterns. Generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable 𝑑𝑖𝑠 and the variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand, generally speaking, the relationships between other variables are positive at the low volume level, while the relationships may get vague and non-related at the high volume level. On the other hand, we can also find the possible explanations according to the meanings of these variables. On the other hand, we also found that the non-linearity between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠. Therefore, we cannot use the correlation between these two variables to quantify the strength of relationship between the variable and the variable 𝑑𝑖𝑠. Second, we also made several tests for the Null hypotheses of the fitted model. Using the F-test and the related p-values, we found that the p-values for the null hypotheses 𝛽A = 0, 𝛽@ = 𝛽D = 0, 𝛽@ = 𝛽D = 0 and 𝛽H = 𝛽I are all smaller than 0.05, which means we need to reject all the null hypotheses. Third, we used the forward algorithm to find the best model for the regression problem. To do that, we first applied all the variables into the model and we used the p-values of different variables to test that whether the certain variable is significant in the model. And then, we found that the final model includes the variable 𝑟𝑚A , the variable age as well as the intercept. According to the results, we finally found the best model. At the same time, we also used the backward algorithm to do the model selection. By using the backward algorithm, we first applied the model with nothing, and then, we added the variables one by one into the model to test the p-values of these variables. Finally, according to the results, we found that the model we found through the backward algorithm is the same as the one found by using the forward algorithm.
  • 18. 17 And the same time, we also used both the AIC as well as the BIC criterion to do model selection processes. After doing the model selection, we found that the regression model with smallest AIC score has variable rm^2, age as well as the intercept. The regression model with the smallest BIC score is the same model. And when we checked the regression model, we found the model only contains the variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the model, all the variables in the model are significant. So we will select the model which contains the variables rm^2, age and intercept under the AIC as well as BIC criterion. Finally, we also applied another method which is called the Cross-Validation to further analyze the model selection process. And we also computed the sum of squared prediction error, SSPE, of the Group2. At the same time, we can find that the model we get from the part2.4(question 5) is the same as we get from the part2.5(question 6). Therefore, we get the same results for the same model.
  • 19. 18 4 Appendix The following materials are the related R codes that used for this project. The contents with bold texts denote the original codes. R codes: # Question 1: > nox <- BostonHousing$nox > indus <- BostonHousing$indus > dis <- BostonHousing$dis > tax <- BostonHousing$tax > pairs(~nox+indus+dis+tax,main="Scatterplot for nox,indus,dis,tax") # Question 2: > cor(nox,dis) [1] -0.7692301 > model <- lm(nox~1/dis) > summary(model) # Question 3: (a) > library("mlbench", lib.loc="~/Library/R/3.3/library") > data("BostonHousing") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, dissquare = dis*dis) > u1 <- lm(nox ~ dis+logdis+dissquare +indus + tax, BostonHousing) > u2 <- lm(nox ~ dis+dissquare +indus + tax, BostonHousing) > anova(u1,u2) Analysis of Variance Table Model 1: nox ~ dis + logdis + dissquare + indus + tax Model 2: nox ~ dis + dissquare + indus + tax Res.Df RSS Df Sum of Sq F Pr(>F) 1 500 1.6897 2 501 1.7097 -1 -0.019976 5.911 0.0154 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (b) > u3 <- lm(nox ~ logdis +indus + tax, BostonHousing) > anova(u1,u3) Analysis of Variance Table Model 1: nox ~ dis + logdis + dissquare + indus + tax Model 2: nox ~ logdis + indus + tax Res.Df RSS Df Sum of Sq F Pr(>F) 1 500 1.6897
  • 20. 19 2 502 1.7306 -2 -0.040907 6.0524 0.002528 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (c) > A = matrix(c(0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-1),nrow=3,byrow=TRUE) > Model <- lm(nox ~ dis + logdis + dissquare + indus + tax, BostonHousing) > variance <- (A %*% vcov(Model) %*% t(A)) > E <- eigen(variance, TRUE) > Evalues <- E$values > Evectors <-E$vectors > sqrtvariance <- Evectors %*% diag(1/sqrt(Evalues)) %*% t(Evectors) > Z <- sqrtvariance %*% A %*% coef(Model) > F <- sum(Z^2)/3 > F [1] 42.80353 # Question 4: > Medv<-log(medv) > Rm<-(rm)^2 > Dis<-log(dis) > forward11<-lm(Medv~1) > summary(forward11) Call: lm(formula = Medv ~ 1) Residuals: Min 1Q Median 3Q Max -1.42507 -0.19983 0.01949 0.18436 0.87751 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.03451 0.01817 167 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4088 on 505 degrees of freedom > forward12<-lm(Medv~Rm-1) > summary(forward12) Call: lm(formula = Medv ~ Rm - 1)
  • 21. 20 Residuals: Min 1Q Median 3Q Max -2.5860 -0.1694 0.1560 0.4042 2.3811 Coefficients: Estimate Std. Error t value Pr(>|t|) Rm 0.0735845 0.0005646 130.3 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5208 on 505 degrees of freedom Multiple R-squared: 0.9711, Adjusted R-squared: 0.9711 F-statistic: 1.699e+04 on 1 and 505 DF, p-value: < 2.2e-16 > forward13<-lm(Medv~age-1) > summary(forward13) Call: lm(formula = Medv ~ age - 1) Residuals: Min 1Q Median 3Q Max -2.0839 -0.5927 0.3357 1.5142 3.4463 Coefficients: Estimate Std. Error t value Pr(>|t|) age 0.0369330 0.0008236 44.84 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.373 on 505 degrees of freedom Multiple R-squared: 0.7993, Adjusted R-squared: 0.7989 F-statistic: 2011 on 1 and 505 DF, p-value: < 2.2e-16 > forward14<-lm(Medv~Dis-1) > summary(forward14) Call: lm(formula = Medv ~ Dis - 1) Residuals: Min 1Q Median 3Q Max -2.2240 -0.4238 0.6628 1.2085 3.6475 Coefficients: Estimate Std. Error t value Pr(>|t|)
  • 22. 21 Dis 2.17068 0.03972 54.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.165 on 505 degrees of freedom Multiple R-squared: 0.8554, Adjusted R-squared: 0.8551 F-statistic: 2987 on 1 and 505 DF, p-value: < 2.2e-16 > forward21<-lm(Medv~Rm) > summary(forward21) Call: lm(formula = Medv ~ Rm) Residuals: Min 1Q Median 3Q Max -1.20269 -0.10530 0.06992 0.17255 1.31948 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.878478 0.063036 29.8 <2e-16 *** Rm 0.028909 0.001537 18.8 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3137 on 504 degrees of freedom Multiple R-squared: 0.4123, Adjusted R-squared: 0.4112 F-statistic: 353.6 on 1 and 504 DF, p-value: < 2.2e-16 > forward22<-lm(Medv~age) > summary(forward22) Call: lm(formula = Medv ~ age) Residuals: Min 1Q Median 3Q Max -1.21816 -0.20280 -0.01733 0.16722 1.08442 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.4860274 0.0427295 81.58 <2e-16 *** age -0.0065843 0.0005765 -11.42 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  • 23. 22 Residual standard error: 0.3647 on 504 degrees of freedom Multiple R-squared: 0.2056, Adjusted R-squared: 0.204 F-statistic: 130.4 on 1 and 504 DF, p-value: < 2.2e-16 > forward23<-lm(Medv~Dis) > summary(forward23) Call: lm(formula = Medv ~ Dis) Residuals: Min 1Q Median 3Q Max -1.18240 -0.21227 -0.02365 0.16558 1.20522 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.66935 0.04024 66.338 <2e-16 *** Dis 0.30737 0.03084 9.965 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.374 on 504 degrees of freedom Multiple R-squared: 0.1646, Adjusted R-squared: 0.163 F-statistic: 99.31 on 1 and 504 DF, p-value: < 2.2e-16 > forward31<-lm(Medv~Rm+Dis) > summary(forward31) Call: lm(formula = Medv ~ Rm + Dis) Residuals: Min 1Q Median 3Q Max -1.05461 -0.12689 0.03383 0.16131 1.46235 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.746011 0.061332 28.468 < 2e-16 *** Rm 0.026088 0.001484 17.585 < 2e-16 *** Dis 0.206437 0.024965 8.269 1.21e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2946 on 503 degrees of freedom Multiple R-squared: 0.4827, Adjusted R-squared: 0.4806 F-statistic: 234.6 on 2 and 503 DF, p-value: < 2.2e-16
  • 24. 23 > forward32<-lm(Medv~Rm+age) > summary(forward32) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 > forward41<-lm(Medv~Rm+Dis+age) > summary(forward41) Call: lm(formula = Medv ~ Rm + Dis + age) Residuals: Min 1Q Median 3Q Max -1.06502 -0.11534 0.02519 0.13058 1.43388 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2520854 0.1061091 21.224 < 2e-16 *** Rm 0.0254895 0.0014420 17.676 < 2e-16 *** Dis 0.0402145 0.0376701 1.068 0.286 age -0.0041510 0.0007209 -5.758 1.48e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 502 degrees of freedom Multiple R-squared: 0.5147, Adjusted R-squared: 0.5118 F-statistic: 177.5 on 3 and 502 DF, p-value: < 2.2e-16
  • 25. 24 > forward<-lm(Medv~Rm+age) > summary(forward) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 # Question 5: > backward11<-lm(Medv ~Rm+ age+ Dis) > summary(backward11) Call: lm(formula = Medv ~ Rm + age + Dis) Residuals: Min 1Q Median 3Q Max -1.06502 -0.11534 0.02519 0.13058 1.43388 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2520854 0.1061091 21.224 < 2e-16 *** Rm 0.0254895 0.0014420 17.676 < 2e-16 *** age -0.0041510 0.0007209 -5.758 1.48e-08 *** Dis 0.0402145 0.0376701 1.068 0.286 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 502 degrees of freedom Multiple R-squared: 0.5147, Adjusted R-squared: 0.5118 F-statistic: 177.5 on 3 and 502 DF, p-value: < 2.2e-16
  • 26. 25 > backward21<-lm(Medv ~Rm+ age) > summary(backward21) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 # Question 6: > data("BostonHousing",package="mlbench") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, logmedv = log(medv)) > BostonHousing <- transform(BostonHousing, rmsq = rm*rm) > attach(BostonHousing) > n <- 506 > m1<-lm(logmedv~1) > m2<-lm(logmedv~rmsq-1) > m3<-lm(logmedv~age-1) > m4<-lm(logmedv~logdis-1) > m5<-lm(logmedv~rmsq+logdis-1) > m6<-lm(logmedv~logdis+age-1) > m7<-lm(logmedv~rmsq+age-1) > m8<-lm(logmedv~age) > m9<-lm(logmedv~rmsq) > m10<-lm(logmedv~logdis) > m11<-lm(logmedv~rmsq+logdis) > m12<-lm(logmedv~rmsq+age) > m13<-lm(logmedv~logdis+age) > m14<-lm(logmedv~rmsq+logdis+age-1) > m15<-lm(logmedv~rmsq+logdis+age)
  • 27. 26 > m16<-lm(logmedv~-1) > > AIC1=n*log(sum(m1$residuals^2)/n)+2*1 > AIC2=n*log(sum(m2$residuals^2)/n)+2*1 > AIC3=n*log(sum(m3$residuals^2)/n)+2*1 > AIC4=n*log(sum(m4$residuals^2)/n)+2*1 > AIC5=n*log(sum(m5$residuals^2)/n)+2*2 > AIC6=n*log(sum(m6$residuals^2)/n)+2*2 > AIC7=n*log(sum(m7$residuals^2)/n)+2*2 > AIC8=n*log(sum(m8$residuals^2)/n)+2*2 > AIC9=n*log(sum(m9$residuals^2)/n)+2*2 > AIC10=n*log(sum(m10$residuals^2)/n)+2*2 > AIC11=n*log(sum(m11$residuals^2)/n)+2*3 > AIC12=n*log(sum(m12$residuals^2)/n)+2*3 > AIC13=n*log(sum(m13$residuals^2)/n)+2*3 > AIC14=n*log(sum(m14$residuals^2)/n)+2*3 > AIC15=n*log(sum(m15$residuals^2)/n)+2*4 > AIC16=n*log(sum(m16$residuals^2)/n)+2*0 > > > BIC1=n*log(sum(m1$residuals^2)/n)+log(n)*1 > BIC2=n*log(sum(m2$residuals^2)/n)+log(n)*1 > BIC3=n*log(sum(m3$residuals^2)/n)+log(n)*1 > BIC4=n*log(sum(m4$residuals^2)/n)+log(n)*1 > BIC5=n*log(sum(m5$residuals^2)/n)+log(n)*2 > BIC6=n*log(sum(m6$residuals^2)/n)+log(n)*2 > BIC7=n*log(sum(m7$residuals^2)/n)+log(n)*2 > BIC8=n*log(sum(m8$residuals^2)/n)+log(n)*2 > BIC9=n*log(sum(m9$residuals^2)/n)+log(n)*2 > BIC10=n*log(sum(m10$residuals^2)/n)+log(n)*2 > BIC11=n*log(sum(m11$residuals^2)/n)+log(n)*3 > BIC12=n*log(sum(m12$residuals^2)/n)+log(n)*3 > BIC13=n*log(sum(m13$residuals^2)/n)+log(n)*3 > BIC14=n*log(sum(m14$residuals^2)/n)+log(n)*3 > BIC15=n*log(sum(m15$residuals^2)/n)+log(n)*4 > BIC16=n*log(sum(m16$residuals^2)/n)+log(n)*0 > min(AIC1,AIC2,AIC3,AIC4,AIC5,AIC6,AIC7,AIC8,AIC9,AIC10,AIC11,AIC12,AIC13,AI C14,AIC15,AIC16) [1] -1265.073 > AIC12 [1] -1265.073 > min(BIC1,BIC2,BIC3,BIC4,BIC5,BIC6,BIC7,BIC8,BIC9,BIC10,BIC11,BIC12,BIC13,BIC 14,BIC15,BIC16) [1] -1252.394
  • 28. 27 > BIC12 [1] -1252.394 # Question 7: > data("BostonHousing",package="mlbench") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, logmedv = log(medv)) > BostonHousing <- transform(BostonHousing, rmsq = rm*rm) > attach(BostonHousing) > > Group1 <- subset(BostonHousing,BostonHousing$zn!=55.0) > Group2 <- subset(BostonHousing,BostonHousing$zn==55.0) > fitmodel <- lm(logmedv~rmsq+age,data = Group1) > summary(fitmodel) Call: lm(formula = logmedv ~ rmsq + age, data = Group1) Residuals: Min 1Q Median 3Q Max -1.07887 -0.10964 0.03389 0.13020 1.41838 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3360096 0.0729281 32.03 <2e-16 *** rmsq 0.0256286 0.0014406 17.79 <2e-16 *** age -0.0047542 0.0004661 -10.20 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2864 on 500 degrees of freedom Multiple R-squared: 0.5129, Adjusted R-squared: 0.5109 F-statistic: 263.2 on 2 and 500 DF, p-value: < 2.2e-16 > p <- predict(fitmodel,newdata=Group2) > SSPE <- sum((Group2$logmedv-p)^2) > SSPE [1] 0.02835043