The following gives more details regarding the Bitcoin Price model using Stock-to-Flow and Active Addresses described here.
Stock-to-Flow and Active Addresses were explored using the following daily data obtained from Coin Metrics.
Feature | Description |
---|---|
BTC / Block Cnt | The sum count of blocks created that interval that were included in the main (base) chain |
BTC / USD Denominated Closing Price | The fixed closing price of the asset as of 00:00 UTC the following day (i.e., midnight UTC of the current day) denominated in USD. This price is generated by Coin Metrics’ fixing/reference rate service. Real-time PriceUSD is the fixed closing price of the asset as of the timestamp set by the block’s miner |
BTC / Active Addr Cnt | The sum count of unique addresses that were active in the network (either as a recipient or originator of a ledger change) that interval. All parties in a ledger change action (recipients and originators) are counted. Individual addresses are not double-counted if previously active |
To compute the Stock-to-Flow, the blocks were converted to daily Bitcoin units according to the halving schedule. Then a rolling window was used to sum the Bitcoin units that flowed into the system during the previous year as of a given day. This yearly rate was the denominator and the total sum of the units to date in the system was the numerator in computing the Stock-to-Flow ratio.
Both Bitcoin Scarcity (Stock-to-Flow) and Network Size (Active Addresses) were considered simultaneously by estimating their effect on price for the period from mid-2010 to present (May 29, 2021). Log10 values of both of these features, along with centered calendar time and a cyclical term representing month, were included in a Generalized Additive Model (GAM) with log10 Bitcoin price as the response.
For those less familiar with GAM models, they can be thought of as the next step along the continuum moving from purely linear models (LM) to those that are generalized linear (GLM). In GLMs, the predictor still has a linear form, but instead of assuming that the data are normally distributed (as in LMs) we can accommodate data of different distributions (e.g. Binomial, Poisson) using various link functions. GAMs take this a step further. While still using a linear predictor, flexible smoother functions are used to represent the effects of the model terms. One way to think of this type of modeling is that is allows more of a conversation between the modeler and the data than when the functional form of the relationship between the predictors and response is set beforehand. In a linear model, for example, the effect of the feature on the response is fixed across it’s range, i.e. there’s just one estimated parameter, no matter what the value the features has. Small values, big values, same per-unit effect on the response. Not so with the GAM, where the effect can change depending on the feature value. Further, the functional form of this relationship isn’t specified beforehand, but is estimated from the data using smooth terms. The smooth terms are build from basis functions, and probably the most important choice to make is the dimension of the basis set (k), resulting in more (smaller k) or less (larger k) smoothness, respectively. For the initial model, a value of k=20 was chosen for calendar time and log10 values of stock-to-flow and number of addresses, which were fit with thin plate regression splines (the default in the gam function). Month was fit with cyclic cubic regression splines and k=12.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## price_log10 ~ s(ad_log10, k = basis_initial) + s(sf_log10, k = basis_initial) +
## s(time, k = basis_initial) + s(month, bs = "cc", k = 12)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.56819 0.00111 2314 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(ad_log10) 18.136 18.91 61.90 <2e-16 ***
## s(sf_log10) 18.825 18.99 120.18 <2e-16 ***
## s(time) 18.661 18.97 341.13 <2e-16 ***
## s(month) 9.225 10.00 59.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.998 Deviance explained = 99.8%
## -REML = -4713.1 Scale est. = 0.0048907 n = 3969
##
## Method: REML Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-8.666761e-06,6.552098e-06]
## (score -4713.132 & scale 0.004890686).
## Hessian positive definite, eigenvalue range [4.412503,1982.628].
## Model rank = 68 / 68
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(ad_log10) 19.00 18.14 0.99 0.23
## s(sf_log10) 19.00 18.82 0.33 <2e-16 ***
## s(time) 19.00 18.66 0.14 <2e-16 ***
## s(month) 10.00 9.23 0.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are several important take-aways from this interesting model. All of the terms appear to be highly significant in relation to Bitcoin price, and appear to explain virtually all of its variation (over 99%). However, the first step is to evaluate whether the k basis functions have been adequate to describe the functional form of the coefficients. With the initial chosen values of k, there remains a cyclic pattern in the residuals, and the effective degrees of freedom (edf), particularly for time and log10 stock-to-flow, appear to be bumping up against the k values that were chosen. This likely means that the smooth functions for these parameters are too constrained to be able to explain their full functional form as displayed in the data. The next step in this case is to fit the model again, this time with a larger k (here k=100) for these features.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## price_log10 ~ s(ad_log10, k = 20) + s(sf_log10, k = 100) + s(time,
## k = 100) + s(month, bs = "cc", k = 12)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5681899 0.0005713 4495 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(ad_log10) 14.541 17.11 13.803 < 2e-16 ***
## s(sf_log10) 80.293 87.24 10.870 < 2e-16 ***
## s(time) 81.665 85.99 78.618 < 2e-16 ***
## s(month) 6.092 10.00 2.258 0.000127 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.999 Deviance explained = 99.9%
## -REML = -7026.8 Scale est. = 0.0012954 n = 3969