Prediction of Real Estate Sale Price using Deep Learning¶

Data from US Counties¶

Data on Adjusted Sale Price were available for approximately 1260 US counties from Zillow Research for the period from Jan 1, 2020 to Jan 1, 2018. Other Real Estate data available by county included sale counts, percent price reduction, the Zillow Home Value Index (ZHVI), days on Zillow, monthly listings and the percent of sales of previous foreclosures. Real Estate (RE) data was available on a monthly frequency.
Other data by US county was assembled including GDP (2015-2018 annual, Bureau of Economic Analysis), population patterns, unemployment and poverty, household include, County Business Patterns and states, census regions and divisions (2010-2018 annual, US Census Bureau). The prime interest rate was also obtained from Federal Reserve Economic Data (2010-2018 monthly, FRED).
The objective of the deep learning (DL) model was to predict the next year's monthly RE adjusted sale price by month, using data from the same month for the previous year (i.e. year over year). Thus the response variable is 12 months ahead of its associated predictors.
A DL model with 2 fully connected layers was used. The categorical variables in the model were represented using embeddings that were estimated along with the model fit. These values were saved and explored for possible patterns.
The final model obtained a validation root mean square percent error of 8% in predicting the adjusted sale price for the following year by US County.

%reload_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from fastai.tabular import *

Data preparation¶

path = Path('Data/')
re_df = pd.read_csv(path/'re_final.csv')

re_df.head()

n = len(re_df); n

261573

re_df.drop(['date','County_Name'], axis = 1, inplace = True)
re_df.fips = re_df.fips.astype(str)

Response: Adjusted Sale Price¶

re_asp_df = re_df.dropna(subset = ['spriceadj_resp'])
re_asp_df = re_asp_df.drop(['preduc_resp','zhvi_resp', 'doz_resp', 'mlist_resp', 
                            'salecounts_resp', 'forpct_resp'], axis = 1)

n_asp = len(re_asp_df); n_asp

84286

re_asp_df = re_asp_df.reset_index(drop=True)

Use data from 2017 as the validation set¶

re_asp_df['year'].dtypes

dtype('int64')

re_asp_df['year'].max()

2017

re_asp_df_train = re_asp_df[re_asp_df.year != 2017]

len(re_asp_df_train)

69082

re_asp_df_valid = re_asp_df[re_asp_df.year == 2017]

len(re_asp_df_valid)

15204

valid_idx = re_asp_df_valid.index

valid_idx

Int64Index([69082, 69083, 69084, 69085, 69086, 69087, 69088, 69089, 69090,
            69091,
            ...
            84276, 84277, 84278, 84279, 84280, 84281, 84282, 84283, 84284,
            84285],
           dtype='int64', length=15204)

dep_var = ['spriceadj_resp']

re_asp_df.columns

Index(['fips', 'year', 'month', 'gdp_percap', 'gdp_pctchg', 'pop_est',
       'birth_rate', 'death_rate', 'natinc_rate', 'intmig_rate', 'dommig_rate',
       'netmig_rate', 'labor_force', 'employed', 'unemployed',
       'unemployment_rate', 'pov_all', 'pov_under18', 'med_house_inc',
       'ann_payK_admin', 'ann_payK_agri', 'ann_payK_arts',
       'ann_payK_construct', 'ann_payK_edu', 'ann_payK_finance',
       'ann_payK_health', 'ann_payK_info', 'ann_payK_manage',
       'ann_payK_manufac', 'ann_payK_mining', 'ann_payK_nc', 'ann_payK_other',
       'ann_payK_prof_tech', 'ann_payK_re', 'ann_payK_retail_trade',
       'ann_payK_services', 'ann_payK_transport', 'ann_payK_utilities',
       'ann_payK_whole_trade', 'no_estab_admin', 'no_estab_agri',
       'no_estab_arts', 'no_estab_construct', 'no_estab_edu',
       'no_estab_finance', 'no_estab_health', 'no_estab_info',
       'no_estab_manage', 'no_estab_manufac', 'no_estab_mining', 'no_estab_nc',
       'no_estab_other', 'no_estab_prof_tech', 'no_estab_re',
       'no_estab_retail_trade', 'no_estab_services', 'no_estab_transport',
       'no_estab_utilities', 'no_estab_whole_trade', 'ann_payK_total',
       'no_estab_total', 'bldgs', 'bldgs_value', 'preduc', 'zhvi', 'doz',
       'mlist', 'salecounts', 'spriceadj', 'forpct', 'MPRIME', 'state',
       'region', 'division', 'spriceadj_resp'],
      dtype='object')

len(re_asp_df.columns)

75

cat_vars = list(re_asp_df.columns[np.r_[0:3, 71:74]])

cont_vars = list(re_asp_df.columns[3:71])

df = re_asp_df[cat_vars + cont_vars + dep_var].copy()

procs=[FillMissing, Categorify, Normalize]

np.random.seed(4537)

data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
                   .split_by_idx(valid_idx)
                   .label_from_df(cols=dep_var, label_cls=FloatList, log=True)
                   .databunch())

Model: Adjusted Sale Price¶

max_log_y = np.log(np.max(re_asp_df_train['spriceadj_resp'])*1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)

learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=exp_rmspe)

len(data.train_ds.cont_names)

68

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

learn.recorder.plot()

learn.fit_one_cycle(5, 1e-2, wd=0.2)

learn.save('rep1')

Embeddings¶

def emb_pca(num_embed, cat_name, df = df):
    embed_array = learn.model.embeds[num_embed].weight[1:]
    temp = embed_array.cpu().detach().numpy()
    temp = pd.DataFrame(temp).add_prefix('emb_')
    temp[cat_name] = df[cat_name].unique()
    temp = pd.merge(df, temp, how = 'left', on = cat_name)
    embed_df = temp.filter(regex='^emb_', axis = 1)
    pca = PCA(n_components = 2)
    principalComponents = pca.fit_transform(embed_df)
    pc_embed_df = pd.DataFrame(data=principalComponents, columns = ['emb_pc1', 'emb_pc2'])
    pc_emb_df = pd.concat([pc_embed_df, df[[cat_name]]], axis = 1)
    return pc_emb_df

emb_data = emb_pca(num_embed = 5, cat_name = 'division')

emb_data.to_csv(path/'division_embed.csv', index = False)

emb_data = emb_pca(num_embed = 4, cat_name = 'region')

emb_data.to_csv(path/'region_embed.csv', index = False)

emb_data = emb_pca(num_embed = 3, cat_name = 'state')

emb_data.to_csv(path/'state_embed.csv', index = False)

emb_data = emb_pca(num_embed = 2, cat_name = 'month')

emb_data.to_csv(path/'month_embed.csv', index = False)

	fips	year	month	gdp_percap	gdp_pctchg	pop_est	birth_rate	death_rate	natinc_rate	intmig_rate	...	state	region	division	preduc_resp	zhvi_resp	doz_resp	mlist_resp	salecounts_resp	spriceadj_resp	forpct_resp
0	1001	2011	1.0	NaN	NaN	55208.0	11.567632	9.221367	2.346265	0.072752	...	AL	South	EastSouthCentral	5.483486	140012.0	NaN	NaN	NaN	NaN	NaN
1	1001	2011	2.0	NaN	NaN	55208.0	11.567632	9.221367	2.346265	0.072752	...	AL	South	EastSouthCentral	3.064415	139985.0	NaN	NaN	NaN	NaN	NaN
2	1001	2011	3.0	NaN	NaN	55208.0	11.567632	9.221367	2.346265	0.072752	...	AL	South	EastSouthCentral	3.083766	139981.0	NaN	NaN	NaN	NaN	NaN
3	1001	2011	4.0	NaN	NaN	55208.0	11.567632	9.221367	2.346265	0.072752	...	AL	South	EastSouthCentral	3.002564	140039.0	NaN	NaN	NaN	NaN	NaN
4	1001	2011	5.0	NaN	NaN	55208.0	11.567632	9.221367	2.346265	0.072752	...	AL	South	EastSouthCentral	2.635889	140007.0	NaN	NaN	NaN	NaN	NaN

epoch	train_loss	valid_loss	exp_rmspe	time
0	0.028179	0.008650	0.089046	00:17
1	0.026375	0.050836	0.193486	00:17
2	0.020912	0.008818	0.094343	00:17
3	0.012700	0.008519	0.087740	00:17
4	0.007227	0.007409	0.079798	00:17