House Price Prediction with Regression: Comparing Regression Models

Nov 1, 2024

4 min read

In this project, I used regression models to predict house prices with the Kaggle house prices dataset. The dataset includes various features like the quality of materials, house size, and construction date, all of which are important predictors of a home’s market value. My goal was to build a model that could accurately predict house prices based on these features and perform well in the Kaggle competition.

The dataset was sourced from Kaggle's House Prices Competition. It includes details on houses and their characteristics, with SalePrice as the target variable.
The dataset provides 79 features, covering various aspects of each property, such as quality, size, layout, garage capacity, and age.
The data also came with both a training set that includes the SalePrice and a test set that does not. The reason it didn't include a SalePrice in the test set is that I have to submit my results to the Kaggle Website to get a score back. (As pictured below)

Data Set: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

Regression is a way of figuring out how different things relate to each other. The main idea is to see how one main variable changes when other variables change. By looking at these relationships, regression helps us make predictions and understand which factors have the biggest influence on our target, like the price of a house.

In this project, I used various regression techniques, including linear regression, Ridge regression, and Random Forest regression, to predict house prices. Each method helped refine the model to better capture the relationship between features like square footage, quality, and age of the house, and the sale price.

I used a heatmap to identify key features with a strong correlation to the target variable, SalePrice. The heatmap (Seen below) shows the most important predictors: OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, and YearBuilt. These features stood out to me as having the most significant impact on house prices.

I handled missing values (Code Below) and converted categorical data to numerical using Label Encoding.
I selected features with the highest correlation to SalePrice to streamline the model and reduce noise.

Modeling:
I trained a basic Linear Regression model with the key features selected and evaluated it only on Kaggle.
Score: 0.61336 on Kaggle
The model provided a foundation but left room for improvement. Patterns in the residuals suggested it was less accurate at higher prices, indicating that a simple linear approach may not fully capture complex relationships.

I applied a log transformation to SalePrice to reduce skewness and improve predictions for high-price houses. (Seen Below)
The introduction of regularization through Ridge Regression helps to address multicollinearity and improve the stability of the model.
I created a Residual Plot to visualize errors :)

The improved score suggested that the model performed more consistently across price ranges. The residuals had a random scatter around zero, confirming that the model handled both high and low values well.

For this experiment, I used a Random Forest Regressor, which captures non-linear relationships and interactions between features that linear models may miss.

The Random Forest model achieved the lowest error, accurately capturing non-linear relationships. The plot showed a slight drop in accuracy at high prices, but overall, it was more reliable than previous models.

This project shows how regression models can make house price predictions more accurate, which could be really helpful for buyers, sellers, and real estate professionals. With better pricing information, people can make smarter decisions when buying or selling a home, and agents can provide more reliable estimates.
However, there are some potential downsides. If used irresponsibly, these models could push prices higher in popular areas, making housing less affordable. There’s also a risk that relying too much on algorithms could reinforce biases, favoring certain neighborhoods or types of homes.
In short, while these models can bring fairness and transparency to the market, it’s important to use them in a way that benefits everyone, without unintentionally contributing to housing inequality.

I have learned and shown how different regression techniques can impact the accuracy of predictions. I learned that using regularization and log transformation can significantly reduce prediction errors, especially when dealing with skewed data. This approach helped stabilize variance and improve model performance. Additionally, using non-linear models, like Random Forests, showed me how effective these methods can be for capturing complex interactions that linear models might miss. This project really showcases the importance of testing multiple approaches and applying data transformations to achieve the best possible results.
This project was a valuable experience in applying data science techniques to real-world problems, and I look forward to further exploring model tuning. :)