• Adjusted Sales Price by US county was collected from Zillow Research for the period from Jan 1, 2010 to Jan 1, 2018. Other real estate data available from Zillow for the period included sale counts, percent price reduction, days on Zillow, monthly listings and the percent of sales that were previously foreclosures. Real estate data was available at a monthly frequency, and Adjusted Sale Price was available for approximately 1260 US Counties for the period 2016 - 2018.

  • Other data available by US county was assembled from US government sources, including GDP (Bureau of Economic Analysis, 2015 - 2018), population patterns, unemployment and poverty, household income and census regions and divisions (US Census Bureau, 2010-2018). These data were available on an annual basis. Also available from the Census Bureau were data on County Business Patterns (2010 - 2017), including the number of establishments and total annual payroll for different types of industries as classified by the North American Industry Classification System (NAICS). These data were transformed so that the by industry values represented the percent of the total number of establishment and payroll, and the total establishments and payroll were divided by the county population, resulting in per capita values.

  • The objective of this analysis was to see 1) if data from a previous year end (December) could predict the adjusted sale price at the end of the following year, and 2) which if any business patterns were associated with higher real estate sale prices. In contrast to our deep learning analysis of these data, here data were extracted from the years 2016 with 2017 response, December only (training data set) and 2017 with 2018 response (test data set), in order to both look at the most recent information and to maximize the ranges for which data on the most covariates were available. In this analysis 5 models were fit: one linear option (Partial Least Squares, or PLS) and 4 non-linear models (Support Vector Machines [SVM], Multivariate Adaptive Regression Splines [MARS], Random Forests [RF] and Gradiant Boosted Machines [GBM]). Cross validation was used to pick the tuning parameters for the models, where SVM and GBM models led the performance metrics from the training set. On the test set, SVM had the best performance with an R-squared value of 0.93 and root mean squared error of 0.12. Since the adjusted sale price was log-transformed before modeling, this corresponds to an approximate 12.7% error in predicting the sale price for the following December. RF and GBM had similar performance metrics on the test set.

  • Important variables in predicting a county’s median real estate home sale price the following year were median household income, the number of establishments classified as Professional/Technical, Real Estate and Other by NAICS, and the death rate. To get an idea of the direction of these effects on sale price, univariate correlations were computed between each of these variables and adjusted sale price for both the training and test sets, which yielded similar results. Factors whose increase indicated an increase in adjusted sale price included household income and the number of establishments classified as Professional/Technical and Real Estate. Factors whose increase indicated a decrease in sale price included the death rate and the number of establishments classified as Other.

Training Performance