
Backtesting investment strategies: methodology, limitations, and how to avoid overfitting
Backtesting is the process of applying an investment strategy to historical price data to estimate what its returns would have been. It is the primary tool for evaluating quantitative strategies before deploying them in live markets—and one of the most commonly misused. A credible backtest requires rigorous methodology and an honest reckoning with its limitations. An optimistic backtest, inflated by data mining or look-ahead bias, can appear compelling in presentation and fail in practice.
What backtesting is
A backtest simulates the trading decisions a strategy would have made, day by day or month by month, if it had been applied to historical data. For a simple momentum strategy, this might mean: at the end of each month, rank all assets by their twelve-month trailing return; buy the top third; sell the bottom third; repeat. The backtest calculates the hypothetical portfolio at each rebalancing point, tracks its value over time, and computes performance metrics—return, volatility, Sharpe ratio, maximum drawdown—across the test period.
The key distinction in backtesting is between in-sample and out-of-sample periods. In-sample data is the data used to design and tune the strategy—the period the researcher sees while building the model. Out-of-sample data is a held-out period that the researcher does not access until the strategy is fully specified. A strategy that performs well on in-sample data has simply fitted the past; only a strategy that performs well on out-of-sample data provides meaningful evidence of a genuine edge.
How it works
A rigorous backtesting methodology follows a standard sequence. First, the strategy is specified in full before any data is examined—every rule, parameter, and trading condition is written down. Second, a portion of the available historical data is reserved as an out-of-sample hold-out. Third, the strategy is applied to the in-sample period to verify that it works as designed and to estimate performance. Fourth, the strategy—unchanged—is applied to the out-of-sample period. If in-sample and out-of-sample performance are broadly consistent, the backtest is more credible.
Walk-forward analysis extends this approach. Rather than a single in-sample/out-of-sample split, the test period is divided into a rolling sequence of windows: the strategy is optimised on a training window, tested on the following out-of-sample window, then the windows advance and the process repeats. Walk-forward results are more conservative than simple in-sample results because the strategy is continuously tested on data it has not seen, and the average of these out-of-sample periods is the reported performance.
Common biases inflate backtest results. Look-ahead bias occurs when the strategy inadvertently uses data that would not have been available at the time of the trading decision—for example, using end-of-month closing prices to generate a signal that was supposedly implemented at the open of the same day. Survivorship bias occurs when the backtest universe contains only companies or funds that survived to the present day, excluding those that failed or were delisted—this systematically overstates returns from stock-selection strategies. Transaction cost underestimation occurs when the backtest uses unrealistically low costs for bid-ask spreads and market impact.
What the evidence shows
Harvey, Liu, and Zhu (2016) analysed 316 published quantitative factors and found that the average in-sample Sharpe ratio of newly published factors exceeded the performance these factors subsequently delivered in live trading. Their analysis suggested that, given the scale of data mining in the academic and practitioner literature, a new factor should require a t-statistic of at least 3.0 (rather than the conventional 2.0) to be considered statistically credible. The study is a sobering benchmark for the claims that most published backtests make.
The decay of backtest performance is well-documented. McLean and Pontiff (2016) studied 97 published anomalies and found that their returns declined by approximately 58% after publication, suggesting that a significant portion of the in-sample return was a combination of genuine premium and data mining. Strategies that are not implemented by large institutional capital—those that are too small to be arbitraged—tend to decay less.
Limitations and trade-offs
Even a methodologically impeccable backtest cannot account for structural regime changes. A strategy that worked over 1985–2015 was trained on a period of generally declining interest rates, globalising supply chains, and specific macroeconomic conditions. If the macro regime changes—as it did in 2022—past backtest performance may be a misleading guide to future behaviour. No amount of out-of-sample testing resolves this problem, because all historical data predates the regime change.
Backtests also do not capture the real-world experience of managing a strategy. In a backtest, the strategy is applied mechanically and without interruption. In live trading, investors face drawdowns that test conviction, risk management interventions that deviate from the rules, and operational constraints that a backtest ignores. The psychological cost of a 30% drawdown that the backtest said would be recovered is not the same as reading about it in a returns table.
Backtesting in pfolio
pfolio's investment methodology is grounded in documented factor premia—momentum, value, and carry—that have been validated in published academic literature across multiple asset classes, geographies, and time periods. The platform uses out-of-sample validation as a key criterion for any signal included in its selection process. Backtesting methodology is described in detail in how we build portfolios. Portfolio performance since live inception is tracked at pfolio Insights.
Related articles
- Overfitting in quantitative investing: why strategies that worked in backtests fail in practice
- Momentum investing: the evidence behind buying recent winners
- Systematic vs discretionary investing: rules, flexibility, and the evidence on which wins
- Sharpe ratio explained: measuring risk-adjusted portfolio returns
Disclaimer
Get started now

