Basic Linear Regression Modeling in Python

7 min readNov 30, 2020

For my course in Data Science, I performed a linear regression model on the sales of houses in King County. Data for sales in 2014 and 2015 was provided, and Python was used to do the modeling. This is a rundown of my first linear regression modeling project. For the full code, see the Github repo: https://github.com/MullerAC/king-county-house-sales

Business Case

It is important to establish a business case when starting a project like this. This is to create a goal and help move towards a meaningful conclusion. My last blog post’s project didn’t have any real goal, so it just stopped when I no longer felt curious. It’s better to start with a firm goal in mind, and build towards that. Here is the business case I used:

We will predict how much a house should be sold for in order to determine whether a house on the market is being underpriced or overpriced. Our clients are homeowners looking to sell their house, but do not know how much to sell their house for.

Data Cleaning

Both the data itself and a description of the columns was supplied. I read the data into a pandas dataframe to start cleaning it. After turning all the data into numerical data types, I handle any NaN values and create a few new features: ‘yr_since_renovation’, ‘yr_since_built’, and ‘renovated’. I then drop unneeded features: ‘view’, ‘sqft_above’, ‘sqft_living15’, ‘sqft_lot15’, and ‘date’.

import pandas as pd
import numpy as npdf = pd.read_csv(‘data/kc_house_data.csv’)
df[‘yr_sold’] = df.date.map(lambda x: int(x.split(‘/’)[-1]))
df.replace({‘sqft_basement’: {‘?’: ‘0.0’}}, inplace=True)
df.sqft_basement = pd.to_numeric(df.sqft_basement)
df.fillna(0.0, inplace=True)
df[‘yr_since_renovation’] = np.where(df[‘yr_renovated’]==0.0, df[‘yr_sold’]-df[‘yr_built’], df[‘yr_sold’]-df[‘yr_renovated’])
df[‘yr_since_built’] = df[‘yr_sold’] — df[‘yr_built’]
df[‘renovated’] = df.yr_renovated.map(lambda x: 1 if x>0 else 0)
df.drop([‘id’, ‘view’, ‘sqft_above’, ‘sqft_living15’, ‘sqft_lot15’, ‘date’], axis=1, inplace=True)
df.to_csv(‘data/cleaned_data.csv’, index=False)

Exploratory Data Analysis

With the data cleaned, I start to analyze it. Scatter plots of each variable against the target, price, show which variables have obvious linear relationships. Insignificant relationships will be taken care of when I start looking at the p-values of the coefficients, so they don’t need to be handled now. This also makes clearer which variables are categorical and which are continuous.

import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inlinedf = pd.read_csv('data/cleaned_data.csv')
plt.figure(figsize=(15,30))
for i, col in enumerate(df.drop('price', axis=1).columns):
    ax = plt.subplot(6, 3, i+1)
    df.plot.scatter(x=col, y='price', ax=ax, legend=False)
plt.tight_layout()
plt.savefig('figures/scatter-plots.png')
plt.show()

Histograms show which of variables are normally distributed. None of them are, so I will need to log transform them later.

df.hist(figsize = (20,18))
plt.tight_layout()
plt.savefig('figures/histogram-plots.png')
plt.show()

I look at collinear features using a heatmap and decide to remove the ‘yr_built’, ‘yr_renovated’, ‘price’, ‘yr_sold’, and ‘yr_since_renovation’ columns.

import seaborn as snsplt.figure(figsize=(7, 7))
sns.heatmap(corr, center=0, annot=True);
plt.tight_layout()
plt.savefig('figures/heatmap-before.png')
plt.show()

Feature Engineering Baseline Model

In order to prepare for modeling, I need to create dummy variables of the categorical data. Those features are turned into strings from numerics in order to make the get_dummies method work. I also remove any punctuation that may be in the column names, as it may stop our model from working correctly.

df = pd.read_csv(‘data/cleaned_data.csv’)
categoricals = [‘floors’, ‘condition’, ‘grade’, ‘zipcode’]
df = df.astype({col: ‘str’ for col in categoricals})
df = pd.get_dummies(df, drop_first=True)
def col_formatting(col):
 for old, new in subs:
 col = col.replace(old,new)
 return col
subs = [(‘ ‘, ‘_’),(‘.’,’’),(‘,’,’’),(“‘“,””),(‘™’, ‘’), (‘®’,’’),(‘+’,’plus’), (‘½’,’half’), (‘-’,’_’)]
df.columns = [col_formatting(col) for col in df.columns]

At this point, I create the train-test split, using the sklearn method. I keep the default of 75% train and 25% test, and set a random state for repeatability.

from sklearn.model_selection import train_test_splittrain, test = train_test_split(df, random_state=7)
train.to_csv('data/train.csv', index=False)
test.to_csv('data/test.csv', index=False)

With this, I can create the baseline model, using statsmodel.

from statsmodels.formula.api import olspredictors = '+'.join(train.columns[1:])
formula = 'price' + '~' + predictors
model = ols(formula=formula, data=train).fit()

The baseline model has the following metrics:

R2 of 0.821
Test RMSE of 139700
81 significant features (p-value < 0.05) of 103 features total

Iterative Modeling Process

I created the following function to print out the metrics for any model, in order to easily ensure all steps taken improved the model.

import statsmodels.api as sm
import scipy.stats as stats
from sklearn.metrics import mean_squared_errortrain = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')def get_model_data(model, is_log_transformed=False):
    
    train_r2, train_r2_adj = model.rsquared, model.rsquared_adjy_hat_train = model.predict(train.drop('price', axis=1))
    y_train = train['price']
    if is_log_transformed:
        train_rmse = np.sqrt(mean_squared_error(np.exp(y_train), np.exp(y_hat_train)))
    else:
        train_rmse = np.sqrt(mean_squared_error(y_train, y_hat_train))
        
    y_hat_test = model.predict(test.drop('price', axis=1))
    y_test = test['price']
    if is_log_transformed:
        test_rmse = np.sqrt(mean_squared_error(np.exp(y_test), np.exp(y_hat_test)))
    else:
        test_rmse = np.sqrt(mean_squared_error(y_test, y_hat_test))pvalues = model.pvalues.to_dict()
    significant_items = {}
    for key, value in pvalues.items():
        if value < 0.05:
            significant_items[key] = value
        
    print('R2 =', train_r2)
    print('R2 adjusted =', train_r2_adj)
    print('RMSE (train) =', train_rmse)
    print('RMSE (test) =', test_rmse)
    print('number of significant features =', len(significant_items))
    
    sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True)
    plt.title('Q-Q Plot')

To improve on our model, I took the following steps:

dropped collinear features, as determined during the eda

train.drop(['yr_built', 'yr_renovated', 'yr_sold', 'yr_since_renovation'] , axis=1, inplace=True)
test.drop(['yr_built', 'yr_renovated', 'yr_sold', 'yr_since_renovation'] , axis=1, inplace=True)

removed outliers from our datasets, more than 3 standard deviations above from the mean

mean = 5.402966e+05
std = 3.673681e+05
upper_cutoff = mean + (3*std)
train = train[train['price'] < upper_cutoff]
test = test[test['price'] < upper_cutoff]

log transformed applicable continuous features (no negative or zero values)

continuous = ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot']
for col in continuous:
    train[col] = train[col].map(np.log)
    test[col] = test[col].map(np.log)

eliminated insignificant features (p-value < 0.05)

model_dict = list(dict(model.pvalues).items())
model_dict.sort(key = lambda x: x[1], reverse=True)
highest_pvalue = model_dict[0]

while highest_pvalue[1] > 0.05:
    print(f'Dropping "{highest_pvalue[0]}" with p-value {highest_pvalue[1]}')
    train.drop(highest_pvalue[0], inplace = True, axis = 1)
    test.drop(highest_pvalue[0], inplace = True, axis = 1)
    
    predictors = '+'.join(train.columns[1:])
    formula = 'price' + '~' + predictors
    model = ols(formula=formula, data=train).fit()

    model_dict = list(dict(model.pvalues).items())
    model_dict.sort(key = lambda x: x[1], reverse=True)
    highest_pvalue = model_dict[0]

The final model has the following metrics:

R2 of 0.857
Test RMSE of 99701
90 significant features (p-value < 0.05) of 90 features total

In addition to these metrics, I also plotted the residuals of the final model.

plt.scatter(model.predict(train.drop('price', axis = 1)), model.resid)
plt.plot(model.predict(train.drop('price', axis = 1)), [0 for i in range(len(train))])
plt.title('Final Model Residuals')
plt.tight_layout()
plt.savefig('figures/final-residuals-plot.png')
plt.show()

Finally, I save the model coefficients, so they can be analyzed and used to create a prediction function.

model.params.to_csv('data/final_model.csv', header=False)

Final Model Analysis

By looking at the coefficients of the model, some interesting observations can be made.

each bedroom decreases the sale price of a house by 5%
each bathroom increases the sale price of a house by 6%
a 1% change in square footage living area increases the sale price of a house by .48%
a 1% change in square footage lot area increases the sale price of a house by .07%
if the house is on the waterfront, the sale price of a house increases by 60%
a 1% change in square footage basement area decreases the sale price of a house by .00005%
if you move north, a 1 degree increase in latitude increases the sale price of a house by 65%
if you move east, a 1 degree increase in longitude decreases the sale price of a house by 53%
a 1-year increase in the age of a house increases its sale price by .04%
a house that has been renovated has its sale price increased by 6%
using a one-floor house as a baseline, a 1.5-floor house has its price increased by 1.5%, while a 3-floor house has its price decreased by 6.4%. Other numbers of floors are approximately equal in price to a 1-floor house.
using a condition of 1 as a baseline, a condition of 2 increases the price by 19%, a condition of 3 increases the price of a house by 31%, a condition of 4 increases the price by 35%, and a condition of 5 increases the price by 40%

Conclusions

The model could be improved in the future by adding interactive features, polynomial features, and map data.

The final model will be useful in predicting sale prices of houses in King County. We can use these predictions to help our clients set the prices for their houses, and find houses that are currently underpriced.