Prediction of Real Estate Sale Price using Deep Learning

Data from US Counties

  • Data on Adjusted Sale Price were available for approximately 1260 US counties from Zillow Research for the period from Jan 1, 2020 to Jan 1, 2018. Other Real Estate data available by county included sale counts, percent price reduction, the Zillow Home Value Index (ZHVI), days on Zillow, monthly listings and the percent of sales of previous foreclosures. Real Estate (RE) data was available on a monthly frequency.
  • Other data by US county was assembled including GDP (2015-2018 annual, Bureau of Economic Analysis), population patterns, unemployment and poverty, household include, County Business Patterns and states, census regions and divisions (2010-2018 annual, US Census Bureau). The prime interest rate was also obtained from Federal Reserve Economic Data (2010-2018 monthly, FRED).
  • The objective of the deep learning (DL) model was to predict the next year's monthly RE adjusted sale price by month, using data from the same month for the previous year (i.e. year over year). Thus the response variable is 12 months ahead of its associated predictors.
  • A DL model with 2 fully connected layers was used. The categorical variables in the model were represented using embeddings that were estimated along with the model fit. These values were saved and explored for possible patterns.
  • The final model obtained a validation root mean square percent error of 8% in predicting the adjusted sale price for the following year by US County.
In [1]:
%reload_ext autoreload
%autoreload 2
In [2]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from fastai.tabular import *

Data preparation

In [3]:
path = Path('Data/')
re_df = pd.read_csv(path/'re_final.csv')
In [4]:
re_df.head()
Out[4]:
fips year month gdp_percap gdp_pctchg pop_est birth_rate death_rate natinc_rate intmig_rate ... state region division preduc_resp zhvi_resp doz_resp mlist_resp salecounts_resp spriceadj_resp forpct_resp
0 1001 2011 1.0 NaN NaN 55208.0 11.567632 9.221367 2.346265 0.072752 ... AL South EastSouthCentral 5.483486 140012.0 NaN NaN NaN NaN NaN
1 1001 2011 2.0 NaN NaN 55208.0 11.567632 9.221367 2.346265 0.072752 ... AL South EastSouthCentral 3.064415 139985.0 NaN NaN NaN NaN NaN
2 1001 2011 3.0 NaN NaN 55208.0 11.567632 9.221367 2.346265 0.072752 ... AL South EastSouthCentral 3.083766 139981.0 NaN NaN NaN NaN NaN
3 1001 2011 4.0 NaN NaN 55208.0 11.567632 9.221367 2.346265 0.072752 ... AL South EastSouthCentral 3.002564 140039.0 NaN NaN NaN NaN NaN
4 1001 2011 5.0 NaN NaN 55208.0 11.567632 9.221367 2.346265 0.072752 ... AL South EastSouthCentral 2.635889 140007.0 NaN NaN NaN NaN NaN

5 rows × 83 columns

In [5]:
n = len(re_df); n
Out[5]:
261573
In [6]:
re_df.drop(['date','County_Name'], axis = 1, inplace = True)
re_df.fips = re_df.fips.astype(str)

Response: Adjusted Sale Price

In [7]:
re_asp_df = re_df.dropna(subset = ['spriceadj_resp'])
re_asp_df = re_asp_df.drop(['preduc_resp','zhvi_resp', 'doz_resp', 'mlist_resp', 
                            'salecounts_resp', 'forpct_resp'], axis = 1)
In [8]:
n_asp = len(re_asp_df); n_asp
Out[8]:
84286
In [9]:
re_asp_df = re_asp_df.reset_index(drop=True)

Use data from 2017 as the validation set

In [10]:
re_asp_df['year'].dtypes 
Out[10]:
dtype('int64')
In [11]:
re_asp_df['year'].max()
Out[11]:
2017
In [12]:
re_asp_df_train = re_asp_df[re_asp_df.year != 2017]
In [13]:
len(re_asp_df_train)
Out[13]:
69082
In [14]:
re_asp_df_valid = re_asp_df[re_asp_df.year == 2017]
In [15]:
len(re_asp_df_valid)
Out[15]:
15204
In [16]:
valid_idx = re_asp_df_valid.index
In [17]:
valid_idx
Out[17]:
Int64Index([69082, 69083, 69084, 69085, 69086, 69087, 69088, 69089, 69090,
            69091,
            ...
            84276, 84277, 84278, 84279, 84280, 84281, 84282, 84283, 84284,
            84285],
           dtype='int64', length=15204)
In [18]:
dep_var = ['spriceadj_resp']
In [19]:
re_asp_df.columns
Out[19]:
Index(['fips', 'year', 'month', 'gdp_percap', 'gdp_pctchg', 'pop_est',
       'birth_rate', 'death_rate', 'natinc_rate', 'intmig_rate', 'dommig_rate',
       'netmig_rate', 'labor_force', 'employed', 'unemployed',
       'unemployment_rate', 'pov_all', 'pov_under18', 'med_house_inc',
       'ann_payK_admin', 'ann_payK_agri', 'ann_payK_arts',
       'ann_payK_construct', 'ann_payK_edu', 'ann_payK_finance',
       'ann_payK_health', 'ann_payK_info', 'ann_payK_manage',
       'ann_payK_manufac', 'ann_payK_mining', 'ann_payK_nc', 'ann_payK_other',
       'ann_payK_prof_tech', 'ann_payK_re', 'ann_payK_retail_trade',
       'ann_payK_services', 'ann_payK_transport', 'ann_payK_utilities',
       'ann_payK_whole_trade', 'no_estab_admin', 'no_estab_agri',
       'no_estab_arts', 'no_estab_construct', 'no_estab_edu',
       'no_estab_finance', 'no_estab_health', 'no_estab_info',
       'no_estab_manage', 'no_estab_manufac', 'no_estab_mining', 'no_estab_nc',
       'no_estab_other', 'no_estab_prof_tech', 'no_estab_re',
       'no_estab_retail_trade', 'no_estab_services', 'no_estab_transport',
       'no_estab_utilities', 'no_estab_whole_trade', 'ann_payK_total',
       'no_estab_total', 'bldgs', 'bldgs_value', 'preduc', 'zhvi', 'doz',
       'mlist', 'salecounts', 'spriceadj', 'forpct', 'MPRIME', 'state',
       'region', 'division', 'spriceadj_resp'],
      dtype='object')
In [20]:
len(re_asp_df.columns)
Out[20]:
75
In [21]:
cat_vars = list(re_asp_df.columns[np.r_[0:3, 71:74]])
In [22]:
cont_vars = list(re_asp_df.columns[3:71])
In [23]:
df = re_asp_df[cat_vars + cont_vars + dep_var].copy()
In [24]:
procs=[FillMissing, Categorify, Normalize]
In [25]:
np.random.seed(4537)
In [26]:
data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .split_by_idx(valid_idx)
                   .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                   .databunch())

Model: Adjusted Sale Price

In [27]:
max_log_y = np.log(np.max(re_asp_df_train['spriceadj_resp'])*1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)
In [28]:
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=exp_rmspe)
In [29]:
len(data.train_ds.cont_names)
Out[29]:
68
In [30]:
learn.lr_find()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In [31]:
learn.recorder.plot()
In [32]:
learn.fit_one_cycle(5, 1e-2, wd=0.2)
epoch train_loss valid_loss exp_rmspe time
0 0.028179 0.008650 0.089046 00:17
1 0.026375 0.050836 0.193486 00:17
2 0.020912 0.008818 0.094343 00:17
3 0.012700 0.008519 0.087740 00:17
4 0.007227 0.007409 0.079798 00:17
In [33]:
learn.save('rep1')

Embeddings

In [34]:
def emb_pca(num_embed, cat_name, df = df):
    embed_array = learn.model.embeds[num_embed].weight[1:]
    temp = embed_array.cpu().detach().numpy()
    temp = pd.DataFrame(temp).add_prefix('emb_')
    temp[cat_name] = df[cat_name].unique()
    temp = pd.merge(df, temp, how = 'left', on = cat_name)
    embed_df = temp.filter(regex='^emb_', axis = 1)
    pca = PCA(n_components = 2)
    principalComponents = pca.fit_transform(embed_df)
    pc_embed_df = pd.DataFrame(data=principalComponents, columns = ['emb_pc1', 'emb_pc2'])
    pc_emb_df = pd.concat([pc_embed_df, df[[cat_name]]], axis = 1)
    return pc_emb_df
In [35]:
emb_data = emb_pca(num_embed = 5, cat_name = 'division')
In [36]:
emb_data.to_csv(path/'division_embed.csv', index = False)
In [37]:
emb_data = emb_pca(num_embed = 4, cat_name = 'region')
In [38]:
emb_data.to_csv(path/'region_embed.csv', index = False)
In [39]:
emb_data = emb_pca(num_embed = 3, cat_name = 'state')
In [40]:
emb_data.to_csv(path/'state_embed.csv', index = False)
In [41]:
emb_data = emb_pca(num_embed = 2, cat_name = 'month')
In [43]:
emb_data.to_csv(path/'month_embed.csv', index = False)
In [ ]: