Walk-Forward Testing

Walk-forward testing is the only honest way to validate a trading strategy on historical data. Gordon uses it exclusively — no random train/test splits.

Why Random Splits Fail

Financial time series have temporal structure — what happens today depends on what happened yesterday. Randomly splitting data into train/test sets destroys this structure:

Training data from 2023 can "leak" information about 2022 test data
Regime changes (bull → bear) are split across sets
The model learns patterns that don't exist in real-time trading

This is why strategies that look incredible in backtests often fail in production.

How Walk-Forward Works

Walk-forward testing simulates the real experience of a trader who optimizes parameters, then trades forward:

|--- Train Window ---|--- Test Window ---|
                     |--- Train Window ---|--- Test Window ---|
                                          |--- Train Window ---|--- Test Window ---|

For each fold:

Train on the historical window (fit parameters, calibrate)
Test on the immediately following window (measure out-of-sample performance)
Slide the window forward and repeat

Gordon uses 6-month folds, creating 11 rolling windows across the full dataset.

What It Measures

Metric	What it tells you
Per-fold Sharpe	Is the edge consistent across time?
Consistency ratio	What fraction of folds are profitable?
Worst fold drawdown	How bad does it get in the worst period?
Cross-asset consistency	Does the strategy work on BTC, ETH, and SOL?

Gordon's Criteria

For a strategy to pass walk-forward validation:

Criterion	Threshold
Walk-forward consistency	> 50% of folds have Sharpe > 0
Full-sample Sharpe	> 0.5 net of fees
No single fold	Max drawdown > 50%
Multi-asset	Must work on at least 2 of BTC, ETH, SOL

Ablation Testing

Walk-forward tells you if the strategy works. Ablation tells you which parts of the strategy contribute to the edge:

Start with the full strategy (all components enabled)
Remove one component at a time
Re-run walk-forward without that component
Measure the Sharpe delta

If removing a component doesn't reduce Sharpe (or improves it), that component isn't contributing — remove it permanently.

Every overlay in Gordon must prove its lift via ablation before inclusion. No exceptions.

Common Pitfalls

Multiple comparison bias — testing 100 parameter combinations and picking the best one inflates results. Gordon addresses this by fixing parameters before walk-forward, not optimizing within it.
Survivorship bias — only testing assets that exist today. Gordon tests on assets with 5-8+ years of history to include bear markets.
Look-ahead bias — using information that wouldn't be available at trade time. Gordon enforces a 1-bar delay: signal on bar i, position on bar i+1.

Walk-Forward Testing ​

Why Random Splits Fail ​

How Walk-Forward Works ​

What It Measures ​

Gordon's Criteria ​

Ablation Testing ​

Common Pitfalls ​