Housing Prices Analysis

Time series analysis in python

https://github.com/MullerAC/springfield-housing-analysis

Purpose

Investigate and predict the average housing prices in the next two years using various zip codes in Springfield, MO. Use data from 1996 to 2018 and various time series models to determine which zip code would be the best investment to buy houses in.

Exploratory Data Analysis

Analyzing 15,000 different zip codes (the entirety of the Zillow dataset) isn’t very feasible, so after looking at the data we have decided to focus on the six zip codes in one city: Springfield, Missouri. Since we are only interested in analyzing the trends over time, we will remove the unnecessary columns from our dataset. In order to run models on this data, we need to convert the string dates into datetime objects. We also need to convert our dataframe from wide format to long format.

data_springfield = data.loc[(data['City']=='Springfield') & (data['State']=='MO')]
dates = pd.to_datetime(data_springfield.columns.values, format='%Y-%m')
springfield = data_springfield.T.set_index(dates)

The results of a Dickey-fuller test reveal that our data is not stationary. However, this will be taken care of inside our ARIMA models.

  • 65807: 0.788008
  • 65802: 0.233519
  • 65804: 0.391874
  • 65810: 0.081698
  • 65806: 0.211948
  • 65809: 0.133164

The scale of our seasonality is ±$100, even in the worst case. Considering we are talking about housing prices in the hundreds of thousands, this accounts for ~0.1% of our values. We can safely disregard all seasonality concerns.

ARIMA Models

Autocorrelation (ACF) helps us study how each time series observations is related to the past. Partial Autocorrelation (PACF) gives us a correlation of a time series with its own lagged values, controlling for the values of the time series at all shorter lags. These charts will help us determine the starting orders in a ARMA model.

The baseline models use ARIMA and the (1, 2, 0) order as determined by our ACF and PACF analysis.

The final models use ARIMA with a different order for each model, as determined by the pmdarima.auto_arima function. All RMSE values lower than their baselines. Despite the low rmse values, these charts show how closely fit to the data the predictions are, which may indicate overfitting. Based on a cursory look at the charts, the 65804 zip code seems to have the sharpest incline in its forecast, indicating a good investment opportunity.

from pmdarima.arima import auto_arimaorder_dict = {}

for col in springfield.columns:
auto = auto_arima(springfield[col], max_order=None, start_p=1, start_q=1, max_p=10, max_q=10, d=2, max_d=5, information_criterion='aic', seasonal=False)

order_dict[col] = auto.get_params()['order']

order_dict

orders and cross-validated RMSE values:

  • 65807: (2, 2, 1), 1731
  • 65802: (0, 2, 1), 3342
  • 65804: (1, 2, 2), 2229
  • 65810: (2, 2, 2), 4222
  • 65806: (0, 2, 1), 3073
  • 65809: (1, 2, 0), 11553

Facebook Prophet

The same data was run on Facebook Prophet models. The baseline models use default parameters.

from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation, performance_metrics
import itertools
param_grid = {
'changepoint_prior_scale': [0.005, 0.05, 0.5], # default 0.05
'seasonality_prior_scale': [1.0, 10.0], # default 10
'holidays_prior_scale': [1.0, 10.0], # default 10
'seasonality_mode': ['additive', 'multiplicative'] # default 'additive'
}
all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
best_params_all = []
for col in springfield.columns[1:]:
ts = springfield[['ds', col]].rename(columns={col:'y'})
best_params = dict.fromkeys(param_grid.keys())
best_params['zipcode'] = col

rmses = [] # Store the RMSEs for each params here
# Use cross validation to evaluate all parameters
for params in all_params:
m = Prophet(**params).fit(ts) # Fit model with given params
df_cv = cross_validation(m, initial='15 y', period='180 days', horizon = '1 y')
df_p = performance_metrics(df_cv, rolling_window=1)
rmse = df_p['rmse'].values[0]
if 'rmse' not in best_params.keys() or rmse<best_params['rmse']:
best_params.update(params)
best_params['rmse'] = rmse
best_params_all.append(best_params)
print(col, 'model best parameters:')
print(best_params)

tuning_results = pd.DataFrame(best_params_all)
tuning_results.set_index('zipcode', inplace=True)

After hyperparameter tuning, we came up with better models for each zip code. These models have lower RMSE values than their ARIMA equivalents, but appear less prone to overfitting and have narrower confidence intervals on the forecasts.

Analysis

The model for each zip code was chosen base on RMSE. The ARIMA models performed better than the Prophet models for every zip code except 65802.

Choosing the correct model for each zip code, we get the following results:

  • 65809: 69.41
  • 65804: 34.19
  • 65807: 22.76
  • 65810: 11.8
  • 65802: 8.21
  • 65806: 4.32

A steep slope indicates a quick rise in value in the future, indicating a good investment. Based on this, zip code 65809 would be a good investment. However, the RMSE values for that zip code’s model are much higher than the others, possibly indicating a poorer accuracy forecast or more volatility. Instead, the second steepest slope, zip code 65804, should be chosen as the top investment opportunity.

Conclusions

Based on the models, we have concluded that zip code 65804 would be the best zip code for the home renovation company to invest their money in. If focusing in the 65804 area, the company should see the average housing prices continuing to increase over the next two years.

Future Work

It is important to note that these values are already old. Even our 2-year prediction is already 9 months out of date. As well, the chosen ARIMA models had apparent overfitting issues based on their prediction charts, and larger confidence intervals as well. More work would need to be done before making actual investments based on these models.

More recent data would be needed in order to make this model useful for actual investments. If obtained, it could also be used as a sort of holdout dataset, verifying our current conclusions. We have also chosen a 2-year forecast range arbitrarily. In order to improve our model, we would need to perform an analysis on what range these models are most accurate over, and what time frame is useful in the field of housing investments.

Student of Data Science