Fix unit-root bias in backtests

- A May 24, 2026 X thread said price-on-price backtests can produce spurious regressions, then showed a Python workflow to test and remove unit-root bias. - The post’s key figure was about 52% sizing accuracy from volatility scaling, after using Augmented Dickey-Fuller tests, log returns, ARIMA and GARCH. - The linked implementation and discussion remain on X, where the original post includes the code path and next-step references.

A widely shared X thread on May 24 set out a common failure in retail and semi-professional backtesting: regressing one price series on another can generate an apparently strong fit even when the relationship is statistically meaningless. The post described the problem as a unit-root issue and used the familiar symptom — an R-squared near 0.99 — to show how non-stationary price levels can make a weak model look strong. The thread then walked through a Python-based fix built around stationarity testing, return transformations and separate modeling of direction and volatility. The author said a volatility-scaled sizing approach lifted sizing accuracy to about 52%, compared with fixed sizing, according to the post on X. ### Why can a backtest look excellent and still be statistically wrong? Unit-root processes are a standard warning in time-series analysis because non-stationary series can produce spurious regression results, even when the variables do not share a meaningful predictive relationship. The problem is most visible when analysts regress prices on prices instead of working with transformed series whose mean and variance are more stable over time. Princeton economists Markus Müller and Mark Watson, in a paper on spurious regression under strong dependence, described the same broad failure mode: persistent dependence can create significance where none exists. (nixtlaverse.nixtla.io) The X thread’s practical claim matched that literature. A price series that drifts over time can mechanically line up with another drifting series, producing high in-sample fit, low p-values and a backtest that appears robust until it is traded live. The thread framed that as misleading inference rather than true signal, which is consistent with standard unit-root treatment in econometrics. (princeton.edu) ### How did the thread say to test for the problem? The Augmented Dickey-Fuller test was the first checkpoint in the workflow described in the post. The thread said an ADF p-value above 0.05 should be treated as a warning that the series still behaves like a unit-root process and should not be modeled in levels. Nixtla’s documentation describes the ADF test in the same terms: it is used to determine whether a unit root is present in a time series. (eml.berkeley.edu) That step matters because the test is not a trading rule by itself. The ADF result is being used as a gatekeeper for model specification: if the input is non-stationary, the modeler changes the data before fitting predictive structure. That is the core fix the thread was trying to popularize. ### Why switch from prices to log returns? Log returns are commonly used because differencing a price series removes much of the stochastic trend that creates the unit-root problem. (nixtlaverse.nixtla.io) The thread recommended transforming prices into log returns before any directional forecasting, which aligns with standard ARIMA and volatility-model practice in financial time series. Examples in public ARIMA-GARCH tutorials and volatility-model documentation use returns, not raw prices, for exactly that reason. The shift also changes the question being asked. A model on prices often ends up fitting drift and persistence; a model on returns is closer to asking whether there is any short-horizon directional edge after the trend has been removed. That distinction is where many backtests fail. ### Why combine ARIMA with GARCH instead of using one model? ARIMA and GARCH solve different problems. (johal.in) The thread used ARIMA for direction and GARCH for volatility, separating the mean process from the variance process rather than forcing one model to do both. Nixtla’s GARCH documentation and other Python examples describe GARCH as a way to capture time-varying conditional variance — the clustering in volatility that is common in financial returns. (numberanalytics.com) That division is also what made the post’s sizing claim notable. The author said volatility scaling improved sizing accuracy to roughly 52% versus fixed sizing, suggesting the edge was not only in forecasting sign but in changing exposure as predicted variance changed. The thread presented that as a position-sizing improvement, not a claim that ARIMA alone solved return prediction. (nixtlaverse.nixtla.io) ### What should a reader take from the implementation? The practical sequence in the thread was straightforward: test stationarity, convert prices to log returns, fit a directional model on the transformed series, then fit a volatility model for risk and sizing. That workflow is standard enough to be recognizable to quantitative researchers, but the thread packaged it as a warning to backtesters who still rely on level regressions and headline R-squared. (nixtlaverse.nixtla.io) The next step remains the linked code and discussion in the original X post from May 24, where the author included the Python implementation and the sizing comparison that drove the thread’s circulation. (nixtlaverse.nixtla.io)

Fix unit-root bias in backtests

Get your own daily briefing