The following gives more details regarding the Bitcoin Price model using Stock-to-Flow and Active Addresses described here.

Data

Stock-to-Flow and Active Addresses were explored using the following daily data obtained from Coin Metrics.

Feature Description
BTC / Block Cnt The sum count of blocks created that interval that were included in the main (base) chain
BTC / USD Denominated Closing Price The fixed closing price of the asset as of 00:00 UTC the following day (i.e., midnight UTC of the current day) denominated in USD. This price is generated by Coin Metrics’ fixing/reference rate service. Real-time PriceUSD is the fixed closing price of the asset as of the timestamp set by the block’s miner
BTC / Active Addr Cnt The sum count of unique addresses that were active in the network (either as a recipient or originator of a ledger change) that interval. All parties in a ledger change action (recipients and originators) are counted. Individual addresses are not double-counted if previously active

To compute the Stock-to-Flow, the blocks were converted to daily Bitcoin units according to the halving schedule. Then a rolling window was used to sum the Bitcoin units that flowed into the system during the previous year as of a given day. This yearly rate was the denominator and the total sum of the units to date in the system was the numerator in computing the Stock-to-Flow ratio.

Model

Both Bitcoin Scarcity (Stock-to-Flow) and Network Size (Active Addresses) were considered simultaneously by estimating their effect on price for the period from mid-2010 to present (May 29, 2021). Log10 values of both of these features, along with centered calendar time and a cyclical term representing month, were included in a Generalized Additive Model (GAM) with log10 Bitcoin price as the response.

For those less familiar with GAM models, they can be thought of as the next step along the continuum moving from purely linear models (LM) to those that are generalized linear (GLM). In GLMs, the predictor still has a linear form, but instead of assuming that the data are normally distributed (as in LMs) we can accommodate data of different distributions (e.g. Binomial, Poisson) using various link functions. GAMs take this a step further. While still using a linear predictor, flexible smoother functions are used to represent the effects of the model terms. One way to think of this type of modeling is that is allows more of a conversation between the modeler and the data than when the functional form of the relationship between the predictors and response is set beforehand. In a linear model, for example, the effect of the feature on the response is fixed across it’s range, i.e. there’s just one estimated parameter, no matter what the value the features has. Small values, big values, same per-unit effect on the response. Not so with the GAM, where the effect can change depending on the feature value. Further, the functional form of this relationship isn’t specified beforehand, but is estimated from the data using smooth terms. The smooth terms are build from basis functions, and probably the most important choice to make is the dimension of the basis set (k), resulting in more (smaller k) or less (larger k) smoothness, respectively. For the initial model, a value of k=20 was chosen for calendar time and log10 values of stock-to-flow and number of addresses, which were fit with thin plate regression splines (the default in the gam function). Month was fit with cyclic cubic regression splines and k=12.

Model fit, lower basis dimensions

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## price_log10 ~ s(ad_log10, k = basis_initial) + s(sf_log10, k = basis_initial) + 
##     s(time, k = basis_initial) + s(month, bs = "cc", k = 12)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.56819    0.00111    2314   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df      F p-value    
## s(ad_log10) 18.136  18.91  61.90  <2e-16 ***
## s(sf_log10) 18.825  18.99 120.18  <2e-16 ***
## s(time)     18.661  18.97 341.13  <2e-16 ***
## s(month)     9.225  10.00  59.91  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.998   Deviance explained = 99.8%
## -REML = -4713.1  Scale est. = 0.0048907  n = 3969

Checking the model fit

## 
## Method: REML   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-8.666761e-06,6.552098e-06]
## (score -4713.132 & scale 0.004890686).
## Hessian positive definite, eigenvalue range [4.412503,1982.628].
## Model rank =  68 / 68 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'   edf k-index p-value    
## s(ad_log10) 19.00 18.14    0.99    0.23    
## s(sf_log10) 19.00 18.82    0.33  <2e-16 ***
## s(time)     19.00 18.66    0.14  <2e-16 ***
## s(month)    10.00  9.23    0.17  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are several important take-aways from this interesting model. All of the terms appear to be highly significant in relation to Bitcoin price, and appear to explain virtually all of its variation (over 99%). However, the first step is to evaluate whether the k basis functions have been adequate to describe the functional form of the coefficients. With the initial chosen values of k, there remains a cyclic pattern in the residuals, and the effective degrees of freedom (edf), particularly for time and log10 stock-to-flow, appear to be bumping up against the k values that were chosen. This likely means that the smooth functions for these parameters are too constrained to be able to explain their full functional form as displayed in the data. The next step in this case is to fit the model again, this time with a larger k (here k=100) for these features.

Model fit, higher basis dimensions

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## price_log10 ~ s(ad_log10, k = 20) + s(sf_log10, k = 100) + s(time, 
##     k = 100) + s(month, bs = "cc", k = 12)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.5681899  0.0005713    4495   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df      F  p-value    
## s(ad_log10) 14.541  17.11 13.803  < 2e-16 ***
## s(sf_log10) 80.293  87.24 10.870  < 2e-16 ***
## s(time)     81.665  85.99 78.618  < 2e-16 ***
## s(month)     6.092  10.00  2.258 0.000127 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.999   Deviance explained = 99.9%
## -REML = -7026.8  Scale est. = 0.0012954  n = 3969

Checking the model fit

## 
## Method: REML   Optimizer: outer newton
## full convergence after 9 iterations.
## Gradient range [-0.0001245011,2.586162e-05]
## (score -7026.83 & scale 0.001295352).
## Hessian positive definite, eigenvalue range [0.9423967,1984.155].
## Model rank =  228 / 228 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'   edf k-index p-value    
## s(ad_log10) 19.00 14.54    0.99    0.23    
## s(sf_log10) 99.00 80.29    0.53  <2e-16 ***
## s(time)     99.00 81.66    0.23  <2e-16 ***
## s(month)    10.00  6.09    0.25  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model with higher basis dimensions still doesn’t demonstrate ideal diagnostics but improves upon the pattern that was previously seen in the residuals, and gives some breathing room between the effective degrees of freedom and the chosen k values. A model with k=200 for time and log10 stock-to-flow was also fit (not shown), however the basis dimension check was still significant, with similar space between the edf and k values. All other elements of the model fit were similar, so a final model of k=100 for these parameters was chosen. Having converged on a model fit, we are ready to look at the smooth coefficients here.