Project 1
Task #1: Pick a city, scrape as many observations from Zillow
I chose to examine Madison, the capital of Wisconsin. I first scraped exactly 400 observations, generating a comma separated values (.csv) file which I named homes.csv. To understand the data a little bit better, I looked at some basic stats.
|
Beds |
Bathrooms |
Square Footage |
Price |
Mean |
2.99 |
2.29 |
1814.95 |
467777.33 |
1Q |
2 |
2 |
1221 |
259900 |
Median |
3 |
2 |
1544 |
350000 |
3Q |
4 |
3 |
2036.5 |
487550 |
Max |
6 |
5 |
5330 |
12999000 |
Min |
1 |
1 |
45 |
37000 |
I imported homes.csv into a PythonProject that uses Python 3.8 because in my next steps I will be using the tensorflow library, which runs best in this Python version. However, before applying a machine learning model, it is necessary to clean the data.
Task #2: Clean the housing data you obtained and create a number of usable features and targets.
Task #3: Train a model on your target and features.
I then stacked my features, x values x1, x2, x3, x4, and fit the model to my features and target. I used ten epochs because I wanted to see if it would run before committing to a larger epoch size and longer runtime. The model generated a loss equal to Nan, and when I created a column y_pred for model predictions, all y_pred values equaled Nan.
After seeing that the model ran on the data without the zip code variable, I decided to run the model again but with 100 epochs. However, my model did not improve much with increasing epochs, as shown below.
Below is a plot of the absolute value of the difference between predicted and actual house price and actual house price. The points appear similar to an upward facing parabola; points in the middle are closest to an absolute difference of zero, corroborating the hypothesis that the model best predicted the cost of homes in the middle of the price range. It looks like the model was poorest at predicting of the cost of more expensive homes; as house price increases, the difference between actual and predicted price increases as well. The model did not accurately predict the cost of the least expensive homes as well as it predicted homes in the middle of the price range.
|
Difference |
Predicted Price |
Actual Price |
Min |
-12514578.84 |
306950.52 |
37000 |
Max |
515548.8 |
562248.80 |
12999000 |
Mean |
-28258.32 |
439519.01 |
467777.33 |
1Q |
-42606 |
421895.6 |
259900.0 |
Median |
78020 |
440404.51 |
350000 |
3Q |
191739.82 |
465606.93 |
487550 |
Task #4: Rank all homes from best to worst deal
Because this model is skewed to better predict houses in the middle range for house prices, by its parameters, the cheapest houses are the best deals because they are consistently over-predicted by the model and the most expensive houses are the worst deals because they are under-predicted by the model. Below is a plot of difference between actual and predicted price and the actual price. The greater the difference, the better the deal; that is the houses are ranked from left to right as worst to best.
The best deal is one of the cheaper houses in the dataset, 2407 Dunns Marsh Ter. This house costs 46700 but the model values it at 562249, putting the difference at over half a million dollars greater than its actual cost. This house has 3 bedrooms, which is average for the dataset, 4 bathrooms, which is greater than average, and 4988 square feet, which is close to the max. It is definitely a better buy (irrespective of price) than 5209 Harbor Court.
To improve this model, I would like to eventually add a spatial variable such as zip code. I think it would help improve the accuracy for more expensive homes, which are likely expensive due to location. I would also like to add layers to the model to see if that improves its prediction capabilities. For now, the model is somewhat useful in that it can predict homes in the middle of the range of prices.