Walk-Forward Testing
Walk-forward testing is the only honest way to validate a trading strategy on historical data. Gordon uses it exclusively — no random train/test splits.
Why Random Splits Fail
Financial time series have temporal structure — what happens today depends on what happened yesterday. Randomly splitting data into train/test sets destroys this structure:
- Training data from 2023 can "leak" information about 2022 test data
- Regime changes (bull → bear) are split across sets
- The model learns patterns that don't exist in real-time trading
This is why strategies that look incredible in backtests often fail in production.
How Walk-Forward Works
Walk-forward testing simulates the real experience of a trader who optimizes parameters, then trades forward:
|--- Train Window ---|--- Test Window ---|
|--- Train Window ---|--- Test Window ---|
|--- Train Window ---|--- Test Window ---|For each fold:
- Train on the historical window (fit parameters, calibrate)
- Test on the immediately following window (measure out-of-sample performance)
- Slide the window forward and repeat
Gordon uses 6-month folds, creating 11 rolling windows across the full dataset.
What It Measures
| Metric | What it tells you |
|---|---|
| Per-fold Sharpe | Is the edge consistent across time? |
| Consistency ratio | What fraction of folds are profitable? |
| Worst fold drawdown | How bad does it get in the worst period? |
| Cross-asset consistency | Does the strategy work on BTC, ETH, and SOL? |
Gordon's Criteria
For a strategy to pass walk-forward validation:
| Criterion | Threshold |
|---|---|
| Walk-forward consistency | > 50% of folds have Sharpe > 0 |
| Full-sample Sharpe | > 0.5 net of fees |
| No single fold | Max drawdown > 50% |
| Multi-asset | Must work on at least 2 of BTC, ETH, SOL |
Ablation Testing
Walk-forward tells you if the strategy works. Ablation tells you which parts of the strategy contribute to the edge:
- Start with the full strategy (all components enabled)
- Remove one component at a time
- Re-run walk-forward without that component
- Measure the Sharpe delta
If removing a component doesn't reduce Sharpe (or improves it), that component isn't contributing — remove it permanently.
Every overlay in Gordon must prove its lift via ablation before inclusion. No exceptions.
Common Pitfalls
- Multiple comparison bias — testing 100 parameter combinations and picking the best one inflates results. Gordon addresses this by fixing parameters before walk-forward, not optimizing within it.
- Survivorship bias — only testing assets that exist today. Gordon tests on assets with 5-8+ years of history to include bear markets.
- Look-ahead bias — using information that wouldn't be available at trade time. Gordon enforces a 1-bar delay: signal on bar i, position on bar i+1.