Skip to content

Bot warmup fallback

When to use: diagnosing a gordon-bot that is stuck warming up, or configuring degraded mode for a managed outage.

gordon-bot fetches its indicator windows from gordon-data via a single POST /warmup call at boot. Dual-source warmup (NATS retention + REST historical) was introduced in Wave 3 Phase 1 (2026-05-09). This page describes what the bot does when the warmup call returns anything other than a fully-complete payload, or when gordon-data is unreachable.

Why this matters

A bot that silently boots on partial data will trade on a stale or truncated indicator window. A 200-bar SMA with 150 bars is a different indicator; a funding-z computed on 3 of 8 observations is wrong, not "approximate". Live capital is at risk — "mostly there" is not an acceptable boot state.

Strict mode (default)

gordon-bot does not enter live trading unless all three conditions hold after the first /warmup response:

ConditionCheck
HTTP status200 OK
Every dataset completeresponse.datasets[].is_complete == true for every entry
Every dataset freshresponse.datasets[].freshness_ts inside strategy staleness budget

Any other outcome — HTTP 503, any is_complete=false, any stale freshness_ts — causes the bot to crash-loop with exponential backoff:

  • Initial retry: 1s
  • Doubling cap: 60s
  • Jitter: 20% of the current interval

The bot process does not skip the check and does not proceed with what it has. It exits non-zero so the container restarts. gordon-manager observes the restart count and surfaces a "bot stuck warming up" alert when the retry count crosses 5 within a 5-minute window.

Strict mode is the production default. Do not disable it on a bot running real capital.

Dual-source warmup (Wave 3 Phase 1, 2026-05-09)

Since Wave 3 Phase 1, the bot's warmup phase uses two sources in order:

  1. NATS JetStream retention — replays recent candles from the gordon-bus stream's retention window (168h). Fast path; no REST call needed for recently-active symbols.
  2. REST historicalPOST /warmup on gordon-data. Used for symbols not yet in NATS retention, or when NATS retention is insufficient to satisfy the strategy's lookback window.

If NATS retention covers the full lookback, the REST call is skipped entirely. If NATS retention is partial, the REST call fills the gap. The bot always validates the merged result against the same strict-mode criteria before transitioning to Live.

Freshness checks

Each datasets[].freshness_ts in the warmup response is the latest data-clock timestamp in the returned window — not the wall-clock moment gordon-data last wrote a row. Wall-clock liveness is a separate signal served by /sources/health. The two answer different questions and must not be conflated.

Staleness budgets per dataset kind:

Dataset kindStaleness budgetRationale
spot_klines (D1)26hOne bar + 2h slack for clock skew
spot_klines (H1)70 minOne bar + 10 min slack
perp_klinesSame as spot equivalent
funding_rates9hOne 8h window + 1h slack
open_interest70 minHourly cadence
long_short_ratio70 minHourly cadence
fear_greed26hDaily cadence
macro26hFRED publishes daily on business days
liquidationsN/A (event stream)Quiet markets are legitimate

A bot finding freshness_ts older than its budget must treat the dataset as unavailable — same outcome as is_complete=false.

Degraded mode (opt-in)

For managed outages (data-provider incident, planned maintenance, testnet-only validation), the bot can be booted with:

bash
GORDON_BOT_WARMUP_MODE=degraded

Behavior:

  • Bot proceeds after /warmup response even when is_complete=false or some datasets are missing.
  • Bot logs a structured warning with event=warmup_degraded and the full list of incomplete datasets.
  • Bot refuses to open new positions for the duration of the session. Existing positions are maintained and managed (stops, take-profits, reconciliation), but no new entries are emitted.
  • On the next restart, strict mode returns unless the env var is still present. Degraded mode does not persist across restarts without explicit operator action.

Degraded mode is a last resort. Log a written justification with the incident and clear the env var as soon as upstream data recovers.

Restart semantics

  • Warmup runs once, on bot boot. Success transitions the bot from Warming to Live.
  • If gordon-data becomes unreachable mid-session, the bot does not re-warmup. It continues running on the already-fetched window. Live candle data arrives independently via NATS subscription and rolls the indicator state forward.
  • A process-level restart (container crash, OOM, deploy) re-runs warmup from scratch.
  • Staged deploys (green/blue) must warm the green instance before cutover. A cold green bot that fails warmup must not take traffic.

Console signal

On the first successful warmup in strict mode, the bot publishes:

topic: bot:<bot_id>:warmup
payload: {
  "status": "ready",
  "trace_id": "<uuid>",
  "served_at_ts": <ms>,
  "datasets_served": [<kind>, ...],
  "warnings": []
}

gordon-manager listens for this event and flips the bot dashboard state from Warming to Live. Absence of the event after the manager's grace window (default 60s) surfaces a "warmup never completed" alert — independent of the bot's crash-loop backoff.

On a degraded-mode boot, status is "degraded" and the warnings array carries every partial-dataset notice. The manager dashboard distinguishes the two states.

Troubleshooting stuck warmup

  1. Check gordon-data health:

    bash
    curl -fsS http://localhost:8081/healthz
    curl -fsS http://localhost:8081/sources/health | jq .
  2. Check the bot's crash-loop logs:

    bash
    docker compose logs gordon-bot --tail=50 | grep -E "warmup|WARN|ERROR"
  3. Verify NATS stream retention:

    bash
    nats stream info gordon-bus   # check "Messages" and "Bytes" — should be non-zero
  4. If gordon-data is down but positions are open and you need the bot to maintain them, use degraded mode as a last resort. Log the justification.

  • POST /warmup — the contract this page governs.
  • GET /sources/health — wall-clock freshness of gordon-data's ingestion. Answers "is upstream alive?"
  • GET /healthz / GET /readyz — service-level liveness. readyz fails if any source is stale past its cadence and outside the boot warmup window.

Gordon — keep compounding without blowing up