Skip to content

Walk-Forward Testing

Walk-forward testing is the only honest way to validate a trading strategy on historical data. Gordon uses it exclusively — no random train/test splits.

Why Random Splits Fail

Financial time series have temporal structure — what happens today depends on what happened yesterday. Randomly splitting data into train/test sets destroys this structure:

  • Training data from 2023 can "leak" information about 2022 test data
  • Regime changes (bull → bear) are split across sets
  • The model learns patterns that don't exist in real-time trading

This is why strategies that look incredible in backtests often fail in production.

How Walk-Forward Works

Walk-forward testing simulates the real experience of a trader who optimizes parameters, then trades forward:

|--- Train Window ---|--- Test Window ---|
                     |--- Train Window ---|--- Test Window ---|
                                          |--- Train Window ---|--- Test Window ---|

For each fold:

  1. Train on the historical window (fit parameters, calibrate)
  2. Test on the immediately following window (measure out-of-sample performance)
  3. Slide the window forward and repeat

Gordon uses 6-month folds, creating 11 rolling windows across the full dataset.

What It Measures

MetricWhat it tells you
Per-fold SharpeIs the edge consistent across time?
Consistency ratioWhat fraction of folds are profitable?
Worst fold drawdownHow bad does it get in the worst period?
Cross-asset consistencyDoes the strategy work on BTC, ETH, and SOL?

Gordon's Criteria

For a strategy to pass walk-forward validation:

CriterionThreshold
Walk-forward consistency> 50% of folds have Sharpe > 0
Full-sample Sharpe> 0.5 net of fees
No single foldMax drawdown > 50%
Multi-assetMust work on at least 2 of BTC, ETH, SOL

Ablation Testing

Walk-forward tells you if the strategy works. Ablation tells you which parts of the strategy contribute to the edge:

  1. Start with the full strategy (all components enabled)
  2. Remove one component at a time
  3. Re-run walk-forward without that component
  4. Measure the Sharpe delta

If removing a component doesn't reduce Sharpe (or improves it), that component isn't contributing — remove it permanently.

Every overlay in Gordon must prove its lift via ablation before inclusion. No exceptions.

Common Pitfalls

  • Multiple comparison bias — testing 100 parameter combinations and picking the best one inflates results. Gordon addresses this by fixing parameters before walk-forward, not optimizing within it.
  • Survivorship bias — only testing assets that exist today. Gordon tests on assets with 5-8+ years of history to include bear markets.
  • Look-ahead bias — using information that wouldn't be available at trade time. Gordon enforces a 1-bar delay: signal on bar i, position on bar i+1.

Gordon — keep compounding without blowing up