This document discusses predicting housing prices in Ames, Iowa using a dataset with 81 variables. The author first cleans the data by imputing missing values and removing outliers. Then, a multi-regression model is developed using backward selection to identify significant variables. However, this leads to overfitting due to having many variables. Therefore, the author creates a regularized linear model which improves accuracy by accounting for overfitting.
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
Housing prices project eeb
1. Predicting Prices in the Iowa Housing
Market (Regularized Linear
Regression)
Erik Bebernes
2. Introduction
This project asks a common question in the field of predictive analytics…what are houses worth?
Identifying the true price of a home is important in preventing a housing bubble, such as the one
that plagued our country in 2008 that ultimately lead to a recession. The data I’m using comes
from Kaggle, and looks specifically at houses in Ames, Iowa. There are 81 variables, with the
control being “Sale Price.” I worked on a problem similar to this as an undergraduate student in
an econometrics class, and although I really enjoyed it, I hadn’t the slightest clue what I was
doing. Now that I am more knowledgeable when it comes to multi-regression analysis I should
be able to come up with some fairly accurate predictions. Before I begin, here is a look at the 81
variables I’ll be working with.
My plan of attack on this project is as follows:
1.) Identify any missing data (both missing at random and not at random) and impute new
data accordingly.
2.) Remove any outliers to reduce model complexity and avoid overfitting.
3.) Run a multi-regression model, using backward selection until the p-value for model as a
whole is below .05.
4.) Try a “regularized” linear model.
Identifying Missing Data and Cleaning It
The first thing I like to do in a lot of my projects is to run a “missmap” on the datasets to see how
much of the data is NA.
3. A handful of variables are nearly completely missing, let’s see what they are and why.
The variables with all of the missing values are “Alley” (type of alley access), “PoolQC” (pool
quality), “FireplaceQu” (fireplace quality), “Fence” (fence quality), “Lot Frontage” (linear feet
of street connected to property) and “MiscFeature” (miscellaneous feature not covered in other
4. categories). The descriptions of these variables make it obvious that the data is not missing at
random, because they are conditional to whether or not the house has that feature to begin with.
This can be said for all of these variables. Look at all of the missing variables related to garages
and basements…these are the houses that don’t have garages and basements. It’s also worth
noting that the amount of NA’s is equal across similar categories (i.e., all of the garage variables
have 81 missing values). There is an easy fix for this. I’m going to replace all NA’s for factor
variables with “none” and NA’s for all numeric variables with 0.
5.
6.
7. Removing Outliers
By making scatterplots of the numeric variables against Sale Price I’ll be able to identify any
outliers and remove them from the dataset. This will simplify the model and reduce any
overfitting when it comes to making the prediction.
8. Multi-Regression Model
In developing my linear model, I used a backward selection method, where I started by including
all of the independent variables and gradually the insignificant ones (where there was a p-value
greater than .05).
The adjusted R-squared (accounts for more error due to an abundance of variables) is .8718,
meaning 87% of the error in the dataset can be explained by the model. The model as a whole
has a p-value of < 2.2e-16, making it significant. Time to make my prediction and see how it
stands up on the kaggle rankings.
After submitting my prediction, I was only in the 13th percentile of most accurate. This is due to
the fact that there is a high variable to observation ratio, leading to overfitting. To account for
this I will attempt to make a “regularized” linear model using the caret package in R, but in order
to do so I need to convert factors of two into dummy variables.