Gordon v7 error codes — operator remediation guide

Per-code sections for every ErrorCode variant in gordon-kernel/src/errors/codes.rs. These are the anchor targets linked from remediation_url fields in structured log lines and from the e2e runbook.

Developer reference (when-it-fires, category, severity): gordon-kernel/src/errors/codes.rs.

EXECUTOR codes

EXECUTOR_UNAUTHORIZED

Severity: Warning | Category: Safety

X-Operator-Token header was absent, empty, or incorrect on a protected endpoint (POST /clear-quarantine, POST /flatten).

Causes: token not set in env (GORDON_EXECUTOR_OPERATOR_TOKEN); misconfigured client; token rotated without redeploying the caller.
Action: verify GORDON_EXECUTOR_OPERATOR_TOKEN in docker-compose.yml; if unset at startup the service returns 503 on all protected routes.
Escalate: if happening in production and token is correct, suspect replay or misconfigured proxy stripping headers.

EXECUTOR_CAP_REJECT_PER_ORDER

Severity: Error | Category: Safety

Intent notional exceeds max_notional_per_order. Order was not submitted.

Causes: strategy sizing math produced an oversized intent; cap configured too low for the current account size.
Action: first, check GORDON_EXECUTOR_MAX_NOTIONAL_USD_PER_ORDER in compose env. If sizing is correct, raise the cap. If the intent is genuinely oversized, fix the strategy's position-sizing formula.
Escalate: repeated rejects from the same bot indicate a sizing bug — quarantine the bot and review the strategy's volatility-target calculation.

EXECUTOR_CAP_REJECT_PER_BOT_DAILY

Severity: Error | Category: Safety

Intent would push this bot's UTC-day rolling notional past GORDON_EXECUTOR_MAX_DAILY_NOTIONAL_USD_PER_BOT. The cap resets at UTC midnight.

Reinstated 2026-05-17 (DP-01 reactivation). Optional-first design: when the env var is unset, the DailyNotionalGuard runs in warn-only mode — the gordon_executor_daily_notional_would_reject_total{scope="per_bot"} counter increments but no reject fires. When the env var is set, the invariant returns RejectionReason::DailyNotionalExceeded { scope: PerBot, .. } which stamps this code on the rejected intent row.

Causes: strategy fired too many entries this UTC day (genuine sizing drift vs original budget); per-bot ceiling configured too low for normal operating cadence; reconciler under-counted on restart.
Action: check gordon_executor_daily_notional_used_usd{bot_id=...} in Grafana for the bot's current consumption vs ceiling. Verify the day rollover happened correctly (UTC, not local time). If sizing is correct, raise the cap via redeploy with a higher GORDON_EXECUTOR_MAX_DAILY_NOTIONAL_USD_PER_BOT. If sizing is wrong, quarantine the bot and audit strategy params.
Escalate: if rejects fire within the first hour of UTC day rollover, the per-bot ceiling is structurally wrong — either too small for normal cadence, or the cap-reconcile-on-restart logic mis-counted carried-over fills.

EXECUTOR_CAP_REJECT_GLOBAL_DAILY

Severity: Error | Category: Safety

Intent would push the global UTC-day rolling notional (sum across every bot) past GORDON_EXECUTOR_MAX_DAILY_NOTIONAL_USD_GLOBAL. The cap resets at UTC midnight.

Reinstated 2026-05-17 (DP-01 reactivation). Optional-first design: same warn-only / enforce gate as EXECUTOR_CAP_REJECT_PER_BOT_DAILY but keyed off GORDON_EXECUTOR_MAX_DAILY_NOTIONAL_USD_GLOBAL. When unset, gauges fire and the would-reject counter increments under scope="global" but no reject fires.

Causes: aggregate strategy capacity exceeded the daily ceiling (multiple bots all firing simultaneously); global ceiling configured too low for the total account size + strategy count.
Action: check gordon_executor_daily_notional_used_usd_global vs the configured ceiling. Sum across bots — if every bot is within its per-bot cap but they collectively exceed the global cap, either raise the global cap (redeploy) or pause one or more bots to free headroom.
Escalate: global cap rejects on a single-bot account = misconfiguration (cap should equal the per-bot cap). Multi-bot rejects = the strategy portfolio is over-allocated relative to the operator-declared daily risk budget; pause bots before raising the cap blindly.

EXECUTOR_RECONCILE_DRIFT

Severity: Error | Category: Safety

In-memory order state diverged from trading.orders during reconcile on restart.

Causes: executor crashed mid-fill update; DB write failure after Binance confirmed the order; concurrent writes (should be impossible — only executor writes this table).
Action: check trading.orders for rows with inconsistent status/filled_qty. Cross-reference with Binance account history (GET /fapi/v1/allOrders). Manually patch the diverged rows, then restart the executor.
Escalate: if reconcile drift occurs on every restart, there is a systematic write failure — check Postgres connectivity and disk space.

EXECUTOR_FLATTEN_FAILED

Severity: Critical | Category: Safety

A flatten operation (break-glass, operator, or risk-initiated) failed at the exchange layer.

Causes: Binance rejected the market order; connectivity loss to exchange; rate limit hit.
Action: check the error context field for the underlying cause. Verify open positions via GET /positions. Retry the flatten via POST /flatten or POST /executor/break-glass/flatten. Open positions are NOT automatically re-closed after this failure — manual intervention required.
Escalate: if flatten fails repeatedly, assume positions are open. Contact exchange support if API keys are the issue.

EXECUTOR_DB_WRITE_FAILED

Severity: Error | Category: Infra

A DB write during order submission or reconcile failed for a transient reason (connectivity, timeout) rather than a constraint violation.

Causes: Postgres connection loss; disk full on DB host; connection pool exhaustion.
Action: check Postgres connectivity from the executor container. Review trading.orders for rows that may have been partially written. Cross-reference with Binance to find orders submitted but not recorded.
Escalate: distinguish from SHARED_DB_CONSTRAINT_VIOLATION (uniqueness/FK errors). If DB writes fail persistently, stop the executor and reconcile manually.

EXECUTOR_FILL_TRACKER_FAILED

Severity: Error | Category: Infra

Fill tracker failed to acquire or maintain a Binance user-data WebSocket listen_key. Fill events are not being received for the affected network.

Causes: Binance API key revoked or expired; network connectivity to Binance WS endpoint; rate limit hit on POST /api/v3/userDataStream.
Action: check Binance API key validity. Review Binance status page for WS outages. Restart the executor to trigger a fresh listen_key acquisition. Fills that arrived during the outage will be reconciled on restart.

EXECUTOR_STARTUP_FAILED

Severity: Critical | Category: Safety

Fatal startup failure: configuration invalid, required env vars absent or unparseable, or an internal assertion failed during process initialisation.

Causes: missing GORDON_EXECUTOR_* env vars; invalid Binance credentials format; DB connection string malformed; port binding conflict.
Action: check container logs for the specific error. Validate env vars against .env.example. Verify DB is reachable from the executor container. Fix configuration and restart.

EXECUTOR_INTENT_REJECTED

Severity: Warning | Category: Safety

An intent was rejected by the invariant pipeline. Umbrella code — the reason_code structured field carries the finer identifier (max_notional_exceeded, funding_guard_exceeded, margin_sanity_exceeded, sl_missing_or_invalid, exchange_data_unavailable, missing_strategy, stale_fence, no_lease).

Causes: bot emitted an intent that exceeds a cap, missed a mandatory SL, raced the fence (stale bot), or the exchange-context lookup (mark price / funding / margin / leverage) failed just before submission.
Action: individual rejects are expected pipeline surface — alert on sustained rate per reason_code. exchange_data_unavailable spikes point at gordon-data degradation; other reasons point at the emitting bot's config or exchange state.

EXECUTOR_SUBMIT_FAILED

Severity: Warning | Category: Safety

Submitter composite outcome: one or both legs rejected at the exchange. Cancel-on-fail choreography recovered (SL was cancelled after entry fail, or entry cancelled after SL fail), or the orphan is left for reconcile to rescue. Distinct from EXECUTOR_EXCHANGE_REJECT which documents the raw -XXXX reject surface.

Causes: hard reject from Binance (invalid symbol, minimum notional violated, reduceOnly sizing wrong); network_not_configured on the intent's network; parallel submit raced with a reconfigure.
Action: check the reason field (entry_submission_failed, sl_submission_failed, sl_submission_failed_cancel_failed, network_not_configured). sl_submission_failed_cancel_failed leaves an orphan entry — verify reconcile on the next restart picks it up.

EXECUTOR_DB_TRANSIENT

Severity: Warning | Category: Infra

Transient DB error on a read / listen path: initial catch-up drain failed, PgListener reconnect, post-reconnect drain, or fence lookup failed. The consumer self-heals on the next NOTIFY; not pageable alone. Distinct from EXECUTOR_DB_WRITE_FAILED (write path, order did not get recorded).

Causes: Postgres restart, network blip between executor and DB, pool exhaustion.
Action: alert only on sustained rate (> 5 per minute). Check Postgres health
- executor → DB connectivity.

EXECUTOR_INTERNAL_ERROR

Severity: Error | Category: Infra

Internal invariant violation: serde serialisation failure, OpenAPI render failure, or a non-startup assertion failed. Should never fire in practice — when it does, the operator must investigate immediately.

Causes: a value that should always serialise (redacted config, compiled-in spec) failed; a type change broke a contract; process memory corruption.
Action: page immediately; restart the executor; open a bug.

EXECUTOR_BOOT_DEGRADED

Severity: Warning | Category: Safety

Executor booted into degraded mode: reconcile quarantined one or more networks so the intent consumer + fill tracker were NOT spawned, or a subsystem (bot-command consumer) failed to spawn. /readyz stays 503 until the operator clears the underlying condition.

Causes: startup reconcile exceeded the critical-anomaly ceiling (GORDON_EXECUTOR_MAX_CRITICAL_ANOMALIES); bot-command NOTIFY subscription handshake failed against Postgres.
Action: inspect trading.reconcile_runs for the anomaly breakdown. Operator investigates and, if safe, calls POST /clear-quarantine with a fresh confirm_token. For bot-command spawn failure, verify the DB pool.

EXECUTOR_SHUTDOWN_ERROR

Severity: Warning | Category: Infra

Abnormal shutdown / serve path: axum serve returned an error, drain budget exceeded, or a signal handler install failed at startup.

Causes: port already in use when binding (rare — caught earlier); OS-level signal delivery failure; background task hang exceeding the 30 s drain budget.
Action: inspect the preceding log lines for the underlying cause. Process is exiting; subsequent boot should come up clean unless the root cause (port conflict, stuck task) recurs.

EXECUTOR_BOT_COMMAND_FAILED

Severity: Warning | Category: Control

trading.bot_commands consumer failed to process a flatten command targeted at the executor, or the cursor commit after processing failed. Delivery is at-least-once — a failed commit redelivers the same row on the next NOTIFY; a failed process is absorbed by the idempotent flatten runner.

Causes: transient DB error while committing the cursor; flatten runner returned an error (which itself is logged as EXECUTOR_FLATTEN_FAILED).
Action: verify the flatten eventually completed (look for BotEvent::FlattenStepComplete on the bot_events stream for the targeted network). If the command redelivers indefinitely, investigate the cursor commit path.

EXECUTOR_BREAK_GLASS_DENIED

Severity: Warning | Category: Safety

Break-glass endpoint rejected the request or the dispatched task failed after auth: auth fail, stale confirm timestamp (±60 s window), audit publish failure, or the background flatten task errored. Every variant is audited via BotEvent::BreakGlassInvoked.

Causes: bearer token mismatch or wrong Authorization header shape; operator's clock skew exceeded 60 s; IPC publish hiccup; ladder errored mid-flatten.
Action: correlate with the BreakGlassOutcome on the audit event (AuthFail, StaleConfirm, MalformedRequest, Accepted). A series of AuthFail + StaleConfirm is an intrusion signal — rotate the break-glass token immediately.

EXECUTOR_RECONCILE_FIX_FAILED

Severity: Warning | Category: Safety

A single reconcile fix attempt failed (non-contract violation). The mismatch is recorded in trading.reconcile_runs; subsequent reconcile passes on restart, or the fill tracker's replay on WS reconnect, will bridge the gap. Contract violations escalate to EXECUTOR_RECONCILE_DRIFT instead.

Causes: placing an orphan SL failed at the exchange (insufficient margin, symbol delisted); synthesised-roundtrip INSERT violated a constraint; state mismatch left for operator review.
Action: inspect the anomaly field for the class + the error field for the underlying cause. If orphan-SL placement keeps failing, the operator must manually attach a SL on the exchange before clearing quarantine.

EXECUTOR_IPC_PUBLISH_FAILED

Severity: Warning | Category: Infra

Best-effort IPC publish failed: BotEvent::ReconcileComplete, BotEvent::ReconcileQuarantine, break-glass audit, or flatten step / completion event. The authoritative state lives in the DB row — audit bus absence does NOT roll back the action.

Causes: Postgres transient error; NOTIFY payload exceeded the Postgres limit (~8 kB); schema-mismatch between the emitted event and a consumer's decoder.
Action: cross-check against trading.reconcile_runs / trading.orders to confirm the action persisted. Alert on sustained rate (> 10 per minute) — that signals a durable IPC problem.

EXECUTOR_INVALID_REQUEST

Severity: Error | Category: Safety

HTTP request failed input validation on an operator endpoint: missing body, unknown network_scope, malformed confirm_token UUID, or a required field was absent. Distinct from EXECUTOR_UNAUTHORIZED (auth failure) — this fires after auth passes but the request body or query is malformed.

Causes: operator / automation sent a malformed payload; schema drift between console client and executor; typo in a CLI call to /flatten or /clear-quarantine.
Action: consult the field pointer in the ErrorResponse body + the OpenAPI spec at /docs. Fix the caller; retry.

EXECUTOR_FLATTEN_STEP_FAILED

Severity: Warning | Category: Safety

A single step inside the flatten driver produced a recoverable failure: per-symbol reduce-only limit submit/cancel errored, book fetch failed, position poll hiccupped, or the targeted network was not configured on this executor. The ladder retries on the next step; only a whole-flatten failure (driver-level abort) surfaces as EXECUTOR_FLATTEN_FAILED.

Causes: transient exchange error; rate-limit hit; symbol not tradable mid-flatten; operator invoked flatten with a network scope this executor has no keys for.
Action: inspect reason_code to localise (network_not_configured, book_fetch_failed, aggressive_limit_failed, market_fallback_failed, symbol_new_rejected, position_poll_failed, step_zero_cancel_failed). Isolated events self-heal; sustained per-symbol streams indicate an exchange issue.

EXECUTOR_FLATTEN_NO_TARGETS

Severity: Warning | Category: Safety / Observability

The flatten ladder dispatched but observed zero non-zero positions across every targeted network — the per-call list_positions() snapshot was empty, so the per-symbol loop ran zero iterations and no flatten trade was produced. Always operationally meaningful: either upstream (gordon-risk / trading.positions) is lying about exposure, the exchange snapshot is out of sync with what the operator expected, or a real flatten was wasted on an already-flat book.

Causes: stale trading.positions ghost row (cluster-A trigger-skip bug); drill harness dispatched without a fresh setup position; mock-binance race between fill broadcast and HTTP positionRisk read; operator invoked flatten on a network scope where no positions were open.
Action: cross-check the operator's expected position state against the exchange snapshot at the dispatch timestamp (Loki: EXECUTOR_FLATTEN_NO_TARGETS
- network_scope + trace_id). Compare trading.positions WHERE qty != 0 vs /fapi/v2/positionRisk for the same (network, symbol) set. If they disagree, the trigger or producer is stale; if they agree but the operator expected exposure, the dispatch was misrouted (wrong network scope).

RISK codes

RISK_BREAKER_TRIPPED

Severity: Critical | Category: Safety

A circuit breaker tripped. Context field breaker names the specific breaker (e.g. DrawdownBreaker, VPINBreaker).

Causes: portfolio drawdown exceeded threshold (DrawdownBreaker); flash-crash VPIN spike (VPINBreaker); correlation density too high (CorrelationBreaker); macro event (MacroBreaker); connectivity loss (ConnectivityBreaker).
Action: first, check trading.risk_events for the breaker variant + timestamp. Second, verify the triggering metric has normalized. Then: POST /risk/resume with a reason field describing the investigation.
Escalate: if a breaker trips repeatedly within hours, the threshold may need tuning — file a plan entry before adjusting. Never raise thresholds under live stress.

RISK_FLATTEN_REQUESTED

Severity: Critical | Category: Safety

Risk service issued an emergency-flatten instruction to gordon-executor.

Causes: manual POST /emergency-flatten call; automatic escalation after a circuit breaker remained tripped past the escalation window.
Action: check trading.risk_audit_log for the flatten scope and reason. Verify all positions are closed on exchange (GET /fapi/v2/positionRisk). Do not resume until the triggering condition is understood and resolved.
Escalate: unexpected automatic flattens (no manual trigger) indicate an escalation state machine bug — check the EscalationManager state in risk service logs.

RISK_PAUSED

Severity: Warning | Category: Safety

Risk service paused one or more bots.

Causes: circuit breaker tripped and the breaker outcome is PauseBots (not flatten); manual POST /bots/:id/pause call.
Action: check trading.risk_events + bot_configs.status. Resume via POST /risk/resume once the triggering metric normalises.
Escalate: bots that stay paused > 24h are likely stuck in escalation — check trading.risk_audit_log and the EscalationManager state.

RISK_HALTED

Severity: Error | Category: Safety

The executor rejected a fresh order intent because the risk-halt latch (trading.risk_state.halted = TRUE) is engaged. The latch flips TRUE on POST /risk/emergency-flatten and on any circuit-breaker trip whose outcome is Flatten; it clears only on POST /risk/resume. Rejected intents carry order_intents.outcome = 'rejected' with outcome_reason = 'risk_halted'.

Full kill-switch contract (Parts 1, 2, 3 all live)

Operator (or breaker) triggers POST /emergency-flatten.
gordon-risk flips trading.risk_state.halted = TRUE atomically with the bot_commands + risk_events audit writes. The halted-column transition fires pg_notify('risk_halt_changed', ...).
gordon-executor's RiskHaltState watcher (LISTEN on risk_halt_changed plus a 5-second polling fallback) picks the flip up and updates its in-memory snapshot.
Every subsequent intent reaching the intent-consumer hits the halt gate before fence / invariant / submitter checks. The row is marked rejected with outcome_reason = 'risk_halted'; a BotEvent::IntentRejected is published; no exchange submission occurs.
Operator decides the system is safe to resume → POST /risk/resume. gordon-risk flips halted = FALSE; the trigger fires a risk_halt_changed NOTIFY; the executor watcher adopts the new value.
Fresh intents flow through the normal path again.

The executor fails closed: if its DB read against trading.risk_state errors (missing grant, table missing, transient pool failure), the snapshot flips back to halted = TRUE and retries every 5 seconds. An executor that cannot verify the halt state is safer halted than submitting blind.

Causes: operator hit the kill switch (POST /emergency-flatten) or a circuit breaker tripped with a flatten outcome; the latch has not been cleared since.
Action: verify open positions are flat (GET /positions on executor). Check trading.risk_audit_log for the halt trace_id and the triggering cause. Once the condition is understood and safe to resume, POST /risk/resume with a reason to clear the latch. Until resume succeeds, every fresh intent will be rejected.
Escalate: if the latch re-engages immediately after resume, a breaker is stuck in a trip loop — check trading.risk_events for the triggering metric and pause the offending bot(s) before resuming. Never bypass the latch.
Diagnose: every executor rejection logs code="RISK_HALTED" with halt_trace_id equal to the trading.risk_audit_log row that engaged the latch — grep Loki for the trace id to correlate.

RISK_REASON_REQUIRED

Severity: Error | Category: Safety

reason field absent or blank on a POST /emergency-flatten, POST /risk/resume, or POST /bots/:id/pause request.

Causes: API caller omitted the reason field; sent an empty string.
Action: add a non-empty reason string to the request body. See the risk service API for the expected schema.
Escalate: if this fires from an automated caller, fix the caller to always include a reason describing the automated context.

RISK_REASON_TOO_LONG

Severity: Error | Category: Safety

reason field exceeds the 500-character limit on an operator risk endpoint request.

Causes: automated caller concatenating unbounded log context into the reason field.
Action: truncate the reason to ≤ 500 characters.

RISK_INVALID_SCOPE

Severity: Error | Category: Safety

scope on POST /emergency-flatten is not one of the accepted variants ("all", "bot:<uuid>", "symbol:<SYMBOL>", "cluster:<id>").

Causes: typo in scope string; unsupported variant attempted.
Action: use one of the four accepted formats exactly. For bot-scoped flatten, use the UUID v7 from bot_configs.id.

RISK_UNAUTHORIZED

Severity: Warning | Category: Safety

X-Operator-Token header absent, wrong, or token not configured on the risk service.

Causes: GORDON_RISK_OPERATOR_TOKEN not set; client sending wrong token; token rotated without updating the caller.
Action: verify GORDON_RISK_OPERATOR_TOKEN in docker-compose.yml. If token is not configured at startup, all protected endpoints return 503.

RISK_INVALID_BOT_ID

Severity: Error | Category: Safety

id path parameter on POST /bots/:id/pause is not a valid UUID v7.

Causes: caller sending a non-UUID id (integer, slug, truncated UUID).
Action: use the UUID v7 from bot_configs.id as the path parameter.

RISK_STARTUP_FAILED

Severity: Critical | Category: Infra

Fatal startup failure — Config::from_env rejected an env var, the Postgres pool could not be opened, or the serve loop returned an error. Process exits with ExitCode::FAILURE (or ExitCode::from(2) for config errors).

Causes: missing / malformed env var (e.g. GORDON_RISK_BIND_ADDR, GORDON_DATABASE_URL); Postgres unreachable; a migration / role-probe failed.
Action: read the structured error field in the log line — it carries the underlying anyhow::Error / sqlx::Error. Fix the config or the DB surface and restart. For DB errors, confirm gordon-migrate has run.
Escalate: a startup that never succeeds is a deployment blocker; the container orchestrator will crash-loop. Investigate immediately.

RISK_SHUTDOWN_ERROR

Severity: Warning | Category: Infra

Non-fatal shutdown-path surface: axum serve returned an error, the drain budget was exceeded, a signal-handler install failed, or the scheduler task join returned an error on teardown. Process is exiting anyway; log fidelity matters for the postmortem.

Causes: scheduler task panicked (code bug); in-flight request took longer than the drain budget; signal handler could not be installed (OS limits).
Action: capture the structured error field, cross-reference with any scheduler-panic stack trace printed earlier. Kept at WARN per op-07c; dashboards should alert only on sustained rate.
Escalate: scheduler-panic-at-shutdown is a code bug — file an issue with the stack trace and the commit SHA.

RISK_BOOT_DEGRADED

Severity: Warning | Category: Infra

Risk booted into a degraded mode. Two main surfaces:

Empty data_freshness (ConnectivityBreaker): no bot has ever posted a trading.bot_events row. The breaker returns Noop so cold-start does not fire an emergency, but this is unexpected if bots are supposed to be running.
FRED macro data absent (MacroBreaker): market_data.macro_data has no DXY / VIX rows. The breaker returns Noop; gordon-data's FRED fetch has not populated yet.

Action: confirm that gordon-bot instances are running (docker ps) and that trading.bot_events is non-empty. For macro data, check gordon-data's FRED ingestor (POST /warmup and the macro tables).
Escalate: > 10 min of RISK_BOOT_DEGRADED after all bots should be online means the data pipeline has a wiring gap.

RISK_DB_TRANSIENT

Severity: Warning | Category: Infra

Transient DB error on a read / listen path: PgListener connect failed, the LISTEN statement failed, recv returned an error, a bot_events row lookup hit a pool blip, or the scheduler snapshot query returned a sqlx::Error.

Causes: brief Postgres restart, pool exhaustion, TCP reset during a lingering connection. Self-heals — the listener reconnects after a 5s sleep, the scheduler retries on the next tick.
Action: check Postgres logs for a restart or a FATAL line around the timestamp. If the pattern persists, raise the pool size or investigate the network between risk and postgres.
Escalate: sustained RISK_DB_TRANSIENT bursts rolling every 5s indicate Postgres is not healthy — page the DB on-call.

RISK_SCHEDULER_TICK_FAILED

Severity: Error | Category: Safety

Scheduler decided to act (a breaker fired) but the risk_events + bot_commands transaction failed to commit. The commanded action did NOT reach executor or bots — they will not flatten / pause until the next cycle retries.

Causes: Postgres outage mid-transaction; constraint violation on the bot_commands row (should not happen with the current schema); a new breaker was added that produces an event_type the scheduler's match arm doesn't know.
Action: pageable. Read the error + breaker fields, confirm the commit did NOT land (SELECT * FROM trading.bot_commands WHERE trace_id = …). If positions are exposed and the breaker's intent was to flatten, trigger POST /emergency-flatten manually while you diagnose.
Escalate: immediate — a safety-critical write that didn't land is a protection gap.

RISK_SNAPSHOT_MISSING_VPIN

Severity: Error | Category: Data

Scheduler found an active position on a symbol for which market_data.metrics.vpin returns zero rows. The VPIN breaker (flash-crash kill switch) cannot evaluate blind, so the whole cycle is skipped — the portfolio is unprotected against VPIN-grade events until the gap closes.

Causes: gordon-data's derived_vpin.rs source is not running, is lagging more than one hour, or the symbol is genuinely new and VPIN has not been derived yet.
Action: pageable. Check gordon-data's /readyz and the VPIN source status (/data/sources). Verify the affected symbol appears in SELECT DISTINCT symbol FROM market_data.metrics WHERE vpin IS NOT NULL.
Escalate: every minute of missing VPIN on an active position is a minute of unprotected exposure. Do not widen this gap — pause the bot with the missing-VPIN symbol while the data source recovers.

RISK_ESCALATION_STEP_FAILED

Severity: Warning | Category: Safety

Best-effort audit row write in the escalation state machine failed. Affects three specific rows:

flatten_requested — logged when a flatten watcher starts.
flatten_completed — logged when FlattenStepComplete arrives on time.
lease_revoked — logged after the 30s timeout clears holder_bot_id.

The commanded action (flatten command on bot_commands; lease clear on bot_leases) is committed in a separate transaction and already succeeded — this log means the audit trail has a gap, not that the action rolled back.

Action: cross-reference trading.risk_audit_log with the trace_id field. If the row is truly missing, file a backfill ticket — the action side is already recorded in trading.bot_commands / trading.risk_events.
Escalate: sustained rate on this code degrades auditability. Check the error field for pattern (FK violation vs pool exhaustion).

RISK_ESCALATION_SUPPRESSED

Severity: Warning | Category: Safety

Escalation manager rejected a new flatten registration for one of two reasons:

In-flight: a watcher is already active for the same scope — second registration would collide. Retry-storm guard.
Cooldown: scope is within the 60-second post-completion cooldown; a fresh flatten right after a successful one is vacuous (portfolio already flat).

The caller's bot_commands row was already committed before the rejection — only the escalation tracking is a no-op.

Action: this is an expected operational surface during operator drill retries or breaker oscillation. If the rate is higher than expected, inspect trading.risk_audit_log for the firing cadence.
Escalate: a cooldown-guarded scope that fires repeatedly after 60s hints that positions are being re-opened faster than the flatten can close them — investigate the bot side.

RISK_FLATTEN_TIMEOUT

Severity: Error | Category: Safety

Escalation watcher timed out: no FlattenStepComplete event arrived on the bot_events channel within the 30-second window. Risk cleared holder_bot_id on every active bot_leases row (UPDATE … SET holder_bot_id = NULL) so bots cannot re-acquire the lease and keep trading.

Note: risk does NOT bump the fence — that is executor's job when it processes the flatten command. This log fires when the executor is either dead or unable to reach the exchange; the safe action is to revoke leases on the risk side.

Action: pageable. Confirm executor is alive (GET /healthz on executor). Check executor logs for EXECUTOR_FLATTEN_FAILED / EXECUTOR_FLATTEN_STEP_FAILED at the same trace_id. Manually verify positions are flat on the exchange (GET /fapi/v2/positionRisk). If positions remain open, trigger a manual flatten via the Binance UI / API.
Escalate: immediate. Risk has given up and revoked leases; executor must be restored before any resume.

RISK_CONFIG_PARSE_FAILED

Severity: Warning | Category: Infra

A trading.risk_config row value could not be parsed as the expected type (decimal, integer, float, array, or JSON object). The breaker falls back to the compiled-in default so evaluation never stalls; the operator should fix the row so the intended threshold is honoured.

Causes: operator wrote a typo into value (e.g. "0.10" as string instead of 0.10 numeric); JSON schema changed without updating seeds; fresh install with a missing row.
Action: SELECT key, value FROM trading.risk_config WHERE key = '<key>'; fix the row to the expected JSON-native type. Also covers the defensive "unknown breaker name" surface in the scheduler where a new breaker was added without updating the event-type match arm.
Escalate: sustained unknown-breaker warnings on the scheduler mean the breaker taxonomy is drifting — update the match arm in scheduler.rs.

RISK_INTERNAL_ERROR

Severity: Error | Category: Infra

Internal invariant violation — should never fire in practice. Four surfaces:

OpenAPI render failed (openapi export or the /openapi.json handler).
stdout write failed during openapi export -.
Redacted-config serialiser returned a serde_json::Error on the /config handler.
ConnectivityBreaker received a non-empty data_freshness but .values().copied().max() returned None (defensive unreachable arm).
Action: capture the error field. For OpenAPI / serialise surfaces, this typically points to a type that is not Serialize / ToSchema — a code change caused the regression. For the breaker defensive arm, verify PortfolioState::snapshot invariants with a debug build.
Escalate: file a bug — every fire here is a code-level regression.

RISK_BOT_EVENT_INVALID

Severity: Warning | Category: Control

The trading.bot_events NOTIFY listener received a payload it cannot act on:

The payload is not a parseable i64 row id.
A flatten_step_complete event row has a trace_id column that is missing or not a valid UUID.

The row is skipped. For flatten_step_complete specifically, the escalation watcher for the corresponding scope will time out after 30s (RISK_FLATTEN_TIMEOUT) and revoke leases — the system stays safe.

Causes: another service wrote a row with a malformed trace_id (producer bug); NOTIFY payload format drift (schema-version mismatch).
Action: inspect the payload / row_id / trace_id fields and locate the producing service. The row is recorded in trading.bot_events; fix the producer.
Escalate: if the pattern repeats, it indicates a serialisation bug in the executor (producer of flatten_step_complete).

RISK_SUBSCRIBER_START_FAILED

Severity: Critical | Category: Control

The escalation watcher's shared PostgresSubscriber failed to start its background catch-up + LISTEN loop on bot_events. Without this loop, risk cannot react to flatten_step_complete / flatten_step_failed signals — every flatten escalates by default at the 30s RISK_FLATTEN_TIMEOUT even when the bot completed flatten in seconds.

Causes: Postgres unreachable at startup, role grants regressed (gordon_risk lost SELECT/UPDATE on pipeline_state or bot_events), pipeline_state table missing, schema drift on the cursor row shape.
Action: check Postgres connectivity from gordon-risk; verify gordon_risk grants via cargo test -p gordon-migrate --test grant_matrix; inspect gordon-risk startup logs for the underlying sqlx error.
Escalate: if the failure persists after restart, the operational impact is the entire flatten escalation pipeline degrades to default-30s behaviour. Treat as P0.

RISK_SUBSCRIBER_COMMIT_FAILED

Severity: Warning | Category: Control

The escalation watcher consumed a bot_events row but failed to commit the cursor offset back to pipeline_state for the risk-escalation consumer. The side-effect already fired (escalation registered, lease revoked, etc.) — the risk is at-least-once delivery: on next risk restart, the same row will be replayed and the side-effect will run twice.

Causes: Postgres transient (connection drop, pool exhaustion); gordon_risk lost UPDATE on pipeline_state.consumed_at; long write contention on the cursor row.
Action: replay tolerance — verify the escalation handler is idempotent (same trace_id should produce same outcome). Check Postgres health + connection pool metrics. Inspect logs for the underlying sqlx error.
Escalate: if commits fail repeatedly, replays could amplify side-effects (multiple lease revocations, double-counted breaker trips). Page on-call if rate exceeds 1/min.

BOT codes

BOT_LEASE_LOST

Severity: Error | Category: Control

Bot's advisory lease expired before renewal; bot must pause and re-acquire.

Causes: DB connectivity interruption preventing lease refresh; lease-refresh goroutine panicked; system clock skew between container and DB host.
Action: check DB connectivity. The bot should pause and attempt lease re-acquisition automatically. If the bot does not recover within 60 s, the manager reconciler will restart it.
Escalate: repeated lease loss from the same bot indicates a systematic connectivity or clock issue.

BOT_QUARANTINED

Severity: Critical | Category: Control

Manager placed the bot in quarantine due to repeated failures.

Causes: bot exceeded the reconciler's failure threshold (consecutive restart failures); signal-emit failures not self-healing; lease loss not self-healing.
Action: inspect logs from the last N bot restarts: docker compose logs gordon-bot --tail=200. Fix the root cause (DB connectivity, strategy bug, config error). Then: POST /bots/:id/clear-quarantine?confirm=YES-<iso-ts>.
Escalate: if the bot quarantines immediately after clear, fix the root cause before attempting another clear. A bot that quarantines within 5 minutes of clear indicates an unresolved code or config bug.

BOT_INVALID_INTENT

Severity: Error | Category: Control

POST /test/emit-intent body failed validation: unknown fields, unparseable JSON, body > 4 KiB, or qty/sl/tp semantic constraints violated.

Causes: test endpoint called with malformed JSON; numeric fields are negative or zero; body is oversized.
Action: fix the request body. This endpoint is only available when GORDON_BOT_STRATEGY=manual; it is not registered in production strategy mode.
Escalate: if this fires in a non-manual deployment, a misconfigured GORDON_BOT_STRATEGY var may have unintentionally exposed the test endpoint.

BOT_STARTUP_FAILED

Severity: Critical | Category: Infra

Bot failed pre-serve startup: missing --bot-id / GORDON_BOT_ID, config-loader rejection, injected failure gate (GORDON_BOT_FAIL_STARTUP=true, exit code 73), or openapi export CLI render / write failure.

Causes: bot spawned without a bot-id arg or env; bot_configs row missing; env var type mismatch; injected-failure flag set (fidelity 06); filesystem / stdout write failure during spec export.
Action: check container env (docker compose config gordon-bot) for GORDON_BOT_ID and strategy / candle-source vars. Verify the trading.bot_configs row exists for the id. For injected failures, unset GORDON_BOT_FAIL_STARTUP and restart.
Escalate: manager spawned the bot with missing env (see gordon-manager logs for the deploy driver).

BOT_SERVE_ERROR

Severity: Error | Category: Infra

Axum serve returned an error after the server accepted traffic — abnormal exit distinct from a clean drain.

Causes: kernel TCP listener error; runtime panic in a request handler; OOM kill mid-serve.
Action: check docker compose logs gordon-bot --tail=200 for a trailing panic or socket error. Manager will respawn on desired_state=running.
Escalate: repeated BOT_SERVE_ERROR without a preceding explanatory log indicates kernel or runtime-level instability on the host.

BOT_SHUTDOWN_ERROR

Severity: Warning | Category: Infra

Abnormal shutdown: signal handler install failed at startup, or the drain deadline (30 s total) was reached before the drain coordinator advanced past step 6.

Causes: rare kernel signal-registration failure; a drain step stalled and tripped the futures_pending timeout.
Action: confirm process exited; advisory lock will auto-release on connection drop. Manager reconciler takes over.
Escalate: correlate with BOT_DRAIN_STEP_FAILED / BOT_DRAIN_BUDGET_EXCEEDED on the same container to identify the stalled step.

BOT_DRAIN_STEP_FAILED

Severity: Warning | Category: Infra

A single drain step (finish_candle, flush_state, close_listeners, release_lease, emit_drained) failed or timed out against its sub-budget. The step field names which step slipped. Drain keeps moving forward.

Causes: strategy loop wedged; Postgres UPDATE bot_strategy_state slow; listener join hung; lease release sqlx timeout.
Action: usually self-healing — advisory lock auto-releases and the reconciler handles residue. Review the preceding log line for the specific step error.
Escalate: sustained per-step failures across deploys point at a systemic Postgres latency or network issue.

BOT_DRAIN_BUDGET_EXCEEDED

Severity: Warning | Category: Infra

Drain elapsed past the 25 s slow threshold or exceeded the 30 s total budget. Process still exits 0 per story 16.8 AC. The stalled_step field names the step that crossed the threshold.

Causes: cascaded step failures, Postgres overload, wedged liveness task.
Action: manager reconciler will respawn the bot if still desired; the advisory lock auto-releases on the connection drop.
Escalate: chronic drain slowness correlates with upstream Postgres degradation — check pg_stat_activity for long-held locks or connection saturation.

BOT_LEASE_ACQUIRE_TIMEOUT

Severity: Error | Category: Control

LeaseGuard::acquire / acquire_shadow did not obtain the advisory lock within the startup budget (30 s, 100 ms → 2 s backoff).

Causes: another bot container holds the lease for the same (symbol, strategy); stale advisory lock on a disconnected backend.
Action: inspect trading.bot_leases + pg_locks to identify the holder. If the holder is stale, kill the owning Postgres session with pg_terminate_backend to drop the advisory lock.
Escalate: repeated timeouts across restarts indicate a rogue bot instance running elsewhere — check Ansible inventory for duplicate container provisioning.

BOT_LEASE_RELEASE_FAILED

Severity: Warning | Category: Control

LeaseGuard::release failed on the graceful-exit path. Not fatal — the advisory lock auto-releases a moment later on connection drop.

Causes: Postgres unreachable during graceful exit; network partition at shutdown.
Action: verify the lock auto-released via pg_locks; the next process start reacquires.
Escalate: pattern across several bots hints at a database-level issue.

BOT_LEASE_LIVENESS_FAILED

Severity: Warning | Category: Control

Lease liveness probe returned a sqlx::Error during the renewal cadence. The loop trips halt, exits cleanly, and a fresh acquire path runs on the next process start.

Causes: transient Postgres connectivity blip; advisory lock connection closed unexpectedly; statement timeout.
Action: self-healing on next start. If repeating, check gordon-postgres health.
Escalate: paired with BOT_LEASE_LOST (Critical) — that's a lock-loss event, investigate PG session tracking immediately.

BOT_SWAP_FAILED

Severity: Error | Category: Control

Green/blue swap handshake failed: AcquireActive timed out (blue didn't release in time), upgrade_shadow_to_active returned a non-timeout SQL error, or the blue PrepareSwap reply indicated failure. Manager aborts the deploy.

Causes: blue didn't receive prepare_swap; blue exited between receiving and replying; DB unavailable during swap; fence contention.
Action: green exits non-zero, manager aborts the deploy, blue remains active. Check manager deploy-driver logs for the matching deploy_aborted event.
Escalate: repeated swap failures block deploys — review bot_deploys state machine in gordon-manager.

BOT_SWAP_IGNORED

Severity: Warning | Category: Control

Swap command deferred / ignored: channel not wired (legacy test path), wrong mode (active bot received acquire_active or shadow received prepare_swap), swap channel closed because the liveness loop already exited, or the reply channel was dropped.

Causes: manager issued a swap command to a bot in the wrong mode; test harness path without swap-channel wiring; bot already draining when command arrived.
Action: audit event swap_deferred or swap_role_pending carries the reason; manager retries or aborts based on the state machine.
Escalate: not pageable alone; pattern with BOT_SWAP_FAILED is the concern.

BOT_SWAP_COMMAND_MALFORMED

Severity: Warning | Category: Control

Swap command payload missing required swap_id or deploy_id. Command is dropped; the swap_command_malformed audit event is published.

Causes: manager bug emitted a partial payload; schema drift between manager and bot.
Action: investigate the originating manager deploy for the swap envelope shape.
Escalate: pins a schema-version mismatch — coordinate gordon-contracts + gordon-manager versions.

BOT_SUBSCRIBER_START_FAILED

Severity: Warning | Category: Infra

Failed to open a PgListener subscription for one of the inbound channels (order_events, fill_events, bot_commands). That listener exits; the rest of the bot keeps running.

Causes: Postgres unreachable at listener start; connection-pool saturation; PgListener session limit.
Action: missing bot_commands means operator commands no longer reach the bot — restart the container. Missing fill_events means fills don't update strategy state until restart (reconcile on next start catches up).
Escalate: pattern across services indicates PG connection ceiling hit.

BOT_SUBSCRIBER_COMMIT_FAILED

Severity: Warning | Category: Infra

commit_offset for a consumed NOTIFY row failed. The row will replay on next drain; the LRU dedup suppresses double application of side effects.

Causes: transient Postgres connectivity; cursor row contention.
Action: self-healing via replay + idempotency; no operator action unless sustained.
Escalate: sustained failures imply pipeline_state write path unhealthy.

BOT_COMMAND_INVALID

Severity: Warning | Category: Control

trading.bot_commands row malformed: row lookup after NOTIFY failed, the row disappeared between NOTIFY and SELECT, the command column is NULL, or the variant is unknown (forward-compat surface).

Causes: race between NOTIFY and row-cleanup job; operator issued an unsupported variant; manager feature flag left stale.
Action: the specific row is dropped; legitimate commands continue to dispatch. Check trading.bot_commands for orphans.
Escalate: unknown variants across deploys warrant a shared-schema bump.

BOT_CONFIG_RELOAD_FAILED

Severity: Warning | Category: Control

reload_config command failed: the bot_configs row reload errored, or the registry refused to re-instantiate the strategy against the new params. The running instance is preserved.

Causes: operator edited bot_configs.strategy_params with invalid JSON; registry schema validation rejects new params; DB unavailable.
Action: fix the strategy_params JSON per the strategy's schema (GET /strategies/:name/schema). Re-issue reload_config once valid.
Escalate: strategy-schema drift between bot image and operator's config source requires coordinated rollout.

BOT_IPC_PUBLISH_FAILED

Severity: Warning | Category: Infra

Best-effort bot_events publish failed (heartbeat, drain, lease, fallback, strategy events). The DB row is authoritative; audit-bus absence does not roll back the action.

Causes: Postgres unreachable; trading.bot_events RLS rejected the insert (expected benign surface — see op-07e log-level rebalance: the heartbeat publisher stamps the RLS GUC before inserting but a transient restart can lose the GUC).
Action: self-healing; not pageable unless sustained. Correlate with gordon-postgres health.
Escalate: sustained publish miss rate → manager dashboards lose the bot's audit trail; investigate the RLS app.bot_id GUC stamping path.

BOT_ROLE_PROBE_BYPASS_DETECTED

Severity: Warning | Category: Control

Startup permission probe (story 16.9 / op-21) ran a statement that should have been rejected at 42501 (insufficient_privilege) but got past the privilege check. Treated as excess-privilege (fail-safe): the role has more grants than intended.

Causes: gordon_bot Postgres role mis-configured; migration 0044 rolled back; manual GRANT run against production DB.
Action: STOP DEPLOYS. Audit \du gordon_bot + \dp trading.* against migration 0044. Revoke excess grants before any bot is restarted.
Escalate: this is a security-posture event — notify infra channel + file an incident report.

BOT_WARMUP_INCOMPLETE

Severity: Error | Category: Data

Warmup (POST /warmup on gordon-data) returned an incomplete dataset or warnings that failed the strict-mode completeness check. The bot aborts boot.

Causes: gordon-data missing historical rows for the requested lookback; partition out of range; live ingest lagging startup.
Action: check gordon-data /healthz + /readyz. Run make seed-klines + make fill-gaps + make precompute to backfill the lookback window.
Escalate: persistent warmup failures after backfill point to gordon-data-side aggregation or partitioning bugs — check the DATA_AGGREGATION_ERROR log stream.

BOT_CANDLE_WS_INVALID_FRAME

Severity: Warning | Category: Data

Candle WS received a frame the bot cannot use: JSON parse failure, server error frame, or payload failed shared-type validation (e.g. unknown symbol).

Causes: gordon-data WS protocol drift; malformed upstream Binance frame; schema-version mismatch.
Action: driver logs + reconnects. Check gordon-data logs for the corresponding server-side error.
Escalate: sustained bad frames correlate with gordon-data / gordon-contracts WS schema mismatch — pin versions.

BOT_CANDLE_FALLBACK_ENGAGED

Severity: Warning | Category: Data

Candle REST fallback engaged (WS down > 30 s) or escalated to degraded tier (engaged

10 min). Operator investigation required at the degraded tier.

Causes: gordon-data WS down; network partition between bot and gordon-data; gordon-data OOM/crash.
Action: check gordon-data /healthz + Docker status. Fallback will auto-disengage when WS reconnects; degraded tier means manual investigation is overdue.
Escalate: > 10 min fallback → paging event; prolonged fallback skews candle freshness and may stall strategies.

BOT_CANDLE_FALLBACK_POLL_FAILED

Severity: Warning | Category: Data

A single fallback-poll attempt failed (HTTP error from gordon-data, or fallback engaged before warmup seeded a cursor). Next 5 s tick retries.

Causes: gordon-data REST unreachable; 5xx response; warmup race.
Action: self-healing on next tick.
Escalate: sustained poll failures imply gordon-data-side degradation.

BOT_CANDLE_REJECTED

Severity: Warning | Category: Data

A scripted candle fixture emitted a candle rejected by the runtime-state cursor (duplicate or out-of-order close_time_ms). Test / fidelity path only; the scripted source never runs in production.

Causes: fixture authored with overlapping rows; replay bug in the scripted driver.
Action: fix the fixture JSON; inspect close_time_ms monotonicity.
Escalate: if seen in production logs, a production container is running the scripted source — misconfiguration, halt deploy.

BOT_STRATEGY_LOOP_HALTED

Severity: Warning | Category: Control

Strategy loop halted on a non-evaluation surface: ChannelSink send failed (downstream already dead), lease halt flag tripped mid-candle, persist_state SQL errored on the no-signal path, serialize_state failed, or Postgres NOW() probe failed.

Causes: downstream consumer dropped; halt flag flipped by listener / drain; transient Postgres error on state persistence.
Action: loop exits cleanly; process reaper (graceful_shutdown) handles the rest. Manager respawns if desired_state=running.
Escalate: correlate with BOT_LEASE_LOST / BOT_DRAIN_STEP_FAILED to identify the upstream trigger.

BOT_STRATEGY_EVALUATION_ERROR

Severity: Error | Category: Control

Strategy::evaluate returned a non-panic StrategyError. The loop halts.

Causes: strategy-side invariant violation (e.g. unsupported timeframe, config drift, state corruption).
Action: check error structured field for the StrategyError variant. Fix config or reset trading.bot_strategy_state if corruption is suspected. Manager respawns on desired_state.
Escalate: repeated eval errors on the same strategy point to a code-level bug — file a story against the strategy crate.

BOT_STRATEGY_PANIC

Severity: Critical | Category: Control

Strategy::evaluate panicked inside the spawn_blocking panic guard. Strategy state is discarded; the loop halts and the process exits so manager can respawn with a clean instance.

Causes: strategy code bug (e.g. unwrap on a None, division by zero, array out-of-bounds).
Action: STOP pulling in new strategy-crate versions until the panic source is fixed. Capture the panic payload from the structured log (panic field) and file a critical issue.
Escalate: paging event. A panicking strategy can quarantine a bot and block its run.

BOT_FENCE_MISMATCH

Severity: Warning | Category: Control

Fence gate inside the emission transaction rejected the intent: bot_leases row missing, holder_bot_id mismatch, or fence advanced externally. Covers live + manual emission paths.

Causes: lease lost mid-flight (racing with liveness detection); operator ran a manual fence bump; green/blue swap between read and commit.
Action: self-healing via halt + respawn path. Verify bot_leases state. Check for an in-flight BOT_LEASE_LOST on the same bot.
Escalate: recurring fence mismatch without a paired BOT_LEASE_LOST indicates an unknown mutator on bot_leases.

BOT_INTENT_EMIT_FAILED

Severity: Error | Category: Control

Atomic intent-emission transaction failed at the SQL layer (fence read + intent insert

state upsert). Covers live and manual emission paths.

Causes: Postgres unreachable mid-txn; order_intents unique-constraint violation; pool saturation.
Action: loop halts; manager respawns if desired_state=running. Inspect the underlying error field for the SQL cause.
Escalate: repeated emit failures block signal production — check Postgres health and pool saturation.

BOT_ON_FILL_FAILED

Severity: Warning | Category: Control

Strategy::on_fill returned an error, a post-on_fill state-serialization or -persistence step failed, or a fill_events payload failed to decode. Listener logs

continues (fill already occurred; SL is exchange-resident).

Causes: strategy on_fill invariant; serialize-state failure; payload schema drift; DB transient error on upsert_strategy_state.
Action: not pageable alone — next evaluate tick re-persists state. For decode failures, confirm executor and bot ship matching gordon-contracts versions.
Escalate: sustained BOT_ON_FILL_FAILED with same intent_id means the strategy is silently ignoring a fill — investigate the strategy's on_fill contract.

BOT_ORDER_EVENT_INVALID

Severity: Warning | Category: Control

An order_events row was malformed: payload decode failed or the event tag is not a known variant (submitted / acked / filled_partial / filled_complete / rejected / cancelled). Forward-compat surface — the row is skipped.

Causes: executor / bot shared-schema version skew; forward-compat additive field that fails older validation.
Action: check the event field in the structured log against the supported taxonomy. Pin compatible gordon-contracts versions between executor + bot.
Escalate: if the unknown event is a new executor state transition not yet wired into the bot pending set, schedule a bot upgrade.

DATA codes

DATA_INGEST_GAP_DETECTED

Severity: Warning | Category: Data

A gap in inbound market data exceeds the configured tolerance (e.g. missing 1m candles).

Causes: Binance WebSocket disconnected and reconnect took > tolerance; Binance API outage; gordon-data container restarted mid-stream.
Action: check docker compose logs gordon-data --tail=100 for reconnect events. Run make fill-gaps to backfill synthetic candles for the gap window. Verify market_data.spot_klines has no missing 1m bars before resuming live bots.
Escalate: gaps > 30 min indicate a sustained Binance outage. Monitor status.binance.com and wait for normalisation before running strategies.

DATA_UNKNOWN_SYMBOL

Severity: Warning | Category: Data

symbol query parameter is not in the configured allowlist (GORDON_DATA_SYMBOL_ALLOWLIST).

Causes: caller requests a symbol not in the default 10-pair allowlist; allowlist not extended after adding a new pair.
Action: if the symbol is intentionally new, add it to GORDON_DATA_SYMBOL_ALLOWLIST in compose env and re-seed historical data. If the symbol is a typo, fix the caller.

DATA_LIMIT_EXCEEDED

Severity: Warning | Category: Data

limit query parameter exceeds the maximum allowed value (5000 for klines endpoints).

Causes: client requesting too many candles in one call.
Action: use pagination (from/to window) or reduce the limit parameter.

DATA_INVALID_TIMEFRAME

Severity: Warning | Category: Data

tf query parameter is not a recognised timeframe string.

Causes: caller using uppercase (1H), a non-standard string, or a timeframe not in: 1m 5m 15m 30m 1h 2h 4h 6h 8h 12h 1d 1w.
Action: fix the caller to use lowercase timeframe strings from the above list.

DATA_INVALID_TIMERANGE

Severity: Warning | Category: Data

from/to window is invalid: from must be strictly less than to (epoch-milliseconds).

Causes: caller reversed from/to; from equals to; milliseconds vs seconds confusion (off by 1000×).
Action: verify both values are epoch-milliseconds and from < to. Check for units confusion — Binance timestamps are milliseconds.

DATA_INVALID_KIND

Severity: Warning | Category: Data

An enum-valued query parameter (kind on /long_short_ratio, side on /liquidations) did not match any accepted variant.

Causes: typo in kind or side value; client code not updated after API change.
Action: fix the caller. Valid kind values: global | top_account | top_position. Valid side values: BUY | SELL.

DATA_INVALID_REQUEST

Severity: Error | Category: Data

POST /warmup request body failed structural or semantic validation before any repository call. Covers: empty dataset list, dataset-count cap (>12) exceeded, unknown kind tag, required field missing or blank, numeric bounds violated (lookback_bars / lookback_count must be 1–5000; lookback_minutes must be 1–10080).

Causes: client bug; stale generated client not matching current API shape.
Action: inspect the message field in the HTTP 400 body for the exact constraint. Regenerate the typed client from the OpenAPI spec if the shape has changed.

DATA_QUERY_FAILED

Severity: Error | Category: Infra

A repository query failed at runtime — the DB returned an error that is neither a constraint violation nor a connectivity probe failure. Typical causes: query timeout, pool exhaustion, temporary connectivity blip.

Causes: Postgres overloaded; pool max_connections exhausted by concurrent warmup requests; transient network partition between gordon-data and Postgres.
Action: check docker compose logs gordon-data --tail=100 for the underlying sqlx error. Check Postgres CPU + connection count (SELECT count(*) FROM pg_stat_activity). The 500 response is returned to the caller — the bot/manager retries the request.
Escalate: if the rate of DATA_QUERY_FAILED is sustained (>1/min over 5 min), Postgres may be saturated. Check market_data.* index health and query plans.

DATA_DB_PROBE_FAILED

Severity: Error | Category: Infra

The DB connectivity probe failed. Fired on /healthz (returns 503) and on the /warmup 503 path. Indicates gordon-data cannot reach Postgres at all.

Causes: Postgres container not running; GORDON_DATABASE_URL misconfigured; network partition between gordon-data and Postgres containers.
Action: docker compose ps postgres — verify Postgres is running. Check GORDON_DATABASE_URL in the service environment. Check container network: both containers must be on the same compose network.
Escalate: if Postgres is running and reachable from the host but not from gordon-data, inspect the compose network config. If the disk is full, Postgres may have shut itself down — check df -h on srv-apps.

DATA_SOURCE_NOT_REGISTERED

Severity: Warning | Category: Data

SourceHealthRegistry::record_success was called with a source ID that was never registered at startup. Indicates a code-level bug — a source emits health ticks without having registered itself in the registry.

Causes: new ingest source added without a register() call in the startup path; source ID mismatch between registration and tick emission.
Action: this is a programming error in gordon-data, not an operator issue. Open a bug report and fix the registration gap. The missing registration means the source will not appear in /sources/health output.

DATA_ROLE_PROBE_ERROR

Severity: Warning | Category: Infra

The startup DB role-probe encountered an unexpected SQL error — neither 42501 (insufficient_privilege) nor 42P01 (undefined_table). The query got past the privilege check, so the service treats this as excess-privilege for fail-safe behaviour and refuses to start.

Causes: DB schema drift; role has unexpected privileges; new Postgres error code not handled by the probe; transient connectivity during the probe query.
Action: check docker compose logs gordon-data --tail=50 for the raw SQL error. Verify the gordon DB role has exactly the privileges in migration 0016 (no INSERT on trading schema; write on market_data). Re-run make db-setup if role permissions have drifted.

DATA_STARTUP_FAILED

Severity: Critical | Category: Infra

Fatal startup failure: configuration load rejected the environment, the serve loop exited with an error, or the backfill-report DB pool could not be opened. Process exits non-zero; orchestrator restarts or pages.

Causes: malformed GORDON_DATA_* env vars; missing DATABASE_URL; Binance URL unreachable at startup; read-only key probe discovered can_withdraw=true.
Action: docker compose logs gordon-data --tail=100 and look for the preceding error = ... context on the failed to load configuration / server exited with error line. Fix the config or credential and restart.
Escalate: if the service crash-loops for >5 min, page on-call.

DATA_SHUTDOWN_ERROR

Severity: Warning | Category: Infra

A background task (scheduler, ingest driver, subscriber, serve loop) returned an error or panicked while the shutdown coordinator was awaiting drain. Process is exiting; drain still proceeds — this is an observability signal, not a safety event.

Causes: race at shutdown where a source handle panicked mid-tick; axum serve returned after the broadcast fired; drain budget exceeded.
Action: informational — look at the task = ... / error = ... fields to see which task slipped. Open a follow-up if the same task fails repeatedly across restarts.

DATA_INTERNAL_ERROR

Severity: Error | Category: Infra

Internal invariant violation: OpenAPI render failed, or a file-/stdout-write of the rendered spec failed on the openapi export subcommand path. Should never fire in practice — it signals a code-level bug (utoipa rejected a derived schema, or stdout was redirected to a read-only target).

Causes: a schema derivation broke after a rebase; openapi export - > file ran with file owned by another user.
Action: cargo run --bin gordon-data -- openapi export - locally to reproduce the render error. If it fails, the OpenAPI schema closure is broken — fix the utoipa::ToSchema derive that regressed.

DATA_BACKFILL_CLI_INVALID

Severity: Error | Category: Data

Backfill CLI rejected command-line arguments before any DB work: invalid date range (from >= to), malformed YYYY-MM-DD date, unknown period token, or no symbols supplied. Process exits with code 2 (usage error).

Causes: operator typo; wrong flag order; missing --symbols on a source that requires it.
Action: inspect the raw = ... / error = ... fields to see which arg was rejected, then re-run with the correction. See gordon-data backfill --help.

DATA_BACKFILL_FAILED

Severity: Error | Category: Data

A backfill job failed at runtime: driver task panicked, cancel-awaited task panicked, finalise reported BackfillRunError (non-cancel), or the CLI subcommand exited non-zero.

Causes: upstream API outage mid-run; DB insert failed; cursor advancement logic hit an unreachable branch.
Action: check GET /backfill/jobs/:id for the error field — it carries the underlying BackfillRunError message. Re-run the job after the upstream is healthy.
Escalate: sustained backfill failures (>3 consecutive) indicate an upstream contract regression or a provider rate-limit change.

DATA_BACKFILL_CONFLICT

Severity: Warning | Category: Data

POST /backfill/<source> rejected because a running job already holds the (source, symbol_key) conflict key. Response is 409 with existing_job_id so the caller can poll the in-flight job.

Causes: operator double-clicked "Run backfill" in the console; automation retried without awaiting completion.
Action: poll GET /backfill/jobs/:existing_job_id until terminal, then re-submit if needed.

DATA_JOB_NOT_FOUND

Severity: Warning | Category: Data

DELETE /backfill/jobs/:id or GET /backfill/jobs/:id was given a job id that is not in the in-memory registry. Returns 404.

Causes: stale UUID from a bookmark; entry aged out of the terminal-retention cap (in-memory-only; no DB persistence).
Action: list current jobs at GET /backfill/jobs; the caller should refresh its job id.

DATA_BACKFILL_CURSOR_STUCK

Severity: Warning | Category: Data

A backfill source (spot_klines / perp_klines / funding_rates / open_interest) detected a cursor that did not advance after a page fetch. Driver breaks out of the per-symbol loop to prevent an infinite spin.

Causes: upstream provider returned a page whose last row shares the cursor timestamp (boundary condition); provider pagination regression.
Action: inspect symbol = ... cursor = ... next_cursor = ... in the log. If next_cursor == cursor, the provider bug is confirmed — re-run with a narrowed window avoiding the stuck boundary. A terminal DATA_BACKFILL_FAILED follows if the symbol yields zero rows.

DATA_INGEST_WS_CLOSED

Severity: Warning | Category: Infra

A WebSocket or IPC subscriber stream ended unexpectedly. The combined ingest receiver was closed by the gordon-exchange reconnect loop, or a PostgresSubscriber stream yielded None (upstream dropped). The ingest / subscriber task exits cleanly.

Causes: gordon-exchange's internal reconnect loop gave up; Postgres LISTEN connection dropped; source commands subscriber was cancelled.
Action: informational by itself. Look for preceding DATA_SOURCE_FETCH_FAILED entries. If the ingest does not resume (watch /sources/health), restart the service.

DATA_INGEST_FRAME_DROPPED

Severity: Warning | Category: Data

A WebSocket frame was unusable: missing symbol, invalid symbol / timeframe format, trade_time overflowed i64, or a malformed DataEvent envelope was decoded. The frame is dropped; the driver continues.

Causes: upstream API emitted a protocol variant new to this build; connection bit-flip on a low-quality network path.
Action: look at the symbol = ... error = ... context. If the same malformed shape repeats across many frames from the same symbol, the upstream contract has shifted — update the parser.

DATA_INGEST_WRITE_FAILED

Severity: Error | Category: Infra

A persist path on the WS ingest failed: klines upsert_one_into or the liquidations bulk insert returned a sqlx::Error. The frame is dropped (not retried); if upstream re-emits the row the next frame replays.

Causes: Postgres connection pool exhausted; partition add-new failed; constraint violation (data-shape mismatch).
Action: check Postgres health; inspect the error = ... field. For liquidations, a dropped bulk insert loses history — alert on sustained rate.
Escalate: sustained write failure (>1/min over 5 min) → page on-call.

DATA_SOURCE_WRITE_FAILED

Severity: Error | Category: Infra

A scheduler source row-write failed at the DB layer. Covers upstream sources (binance_funding, binance_open_interest, binance_long_short_ratio, alternative_fear_greed, defillama_ssr, fred_macro) and derived sources (derived_metrics, derived_vpin, klines_common). Scheduler retry loop applies on the next tick; sustained failure trips the quarantine threshold.

Causes: Postgres pool exhausted; constraint mismatch; partition missing for the target timestamp.
Action: correlate with DATA_INGEST_WRITE_FAILED. If both fire, the DB is unreachable — triage at the DB layer first.

DATA_SOURCE_FETCH_FAILED

Severity: Warning | Category: Infra

A scheduler source fetch failed at the upstream API or during parsing: network error, HTTP non-2xx, response-body parse failure (timestamp, numeric, enum), or a staleness probe hit an error. Retryable; the scheduler's retry-budget + quarantine machinery owns escalation.

Causes: upstream rate limit; provider outage; upstream schema change.
Action: watch for escalation to DATA_SOURCE_QUARANTINED. If the source is Deribit GEX or FRED, a single failure is noisy but not actionable — escalate only on sustained rate.

DATA_SCHEDULER_PANIC

Severity: Error | Category: Infra

The scheduler received SourceError::Panic(msg) on a source fetch — a source invariant was violated inside the fetcher. Distinct from a tokio task panic (which would propagate out of the driver).

Causes: a .unwrap() on a response envelope; a derive regression; an out-of-bounds vector index after an API shape change.
Action: the panicking source is named in source = .... Investigate the invariant. The source is NOT quarantined automatically on panic — operator must decide whether to quarantine manually while the fix ships.

DATA_SOURCE_QUARANTINED

Severity: Warning | Category: Safety

The scheduler flipped a source into the quarantined state after a non-retryable failure or streak-exhaustion. The source is taken out of rotation until an operator issues SourceCommand::Unquarantine via the manager BFF.

Causes: preceding DATA_SOURCE_FETCH_FAILED or DATA_SOURCE_WRITE_FAILED streak.
Action: fix the root cause (upstream or DB); issue an Unquarantine command to resume. Use POST /sources/:name/unquarantine (operator-token-protected).

DATA_IPC_PUBLISH_FAILED

Severity: Warning | Category: Infra

Best-effort DataEvent publish to the trading.data_events Postgres channel failed: KlineWritten, SourceQuarantined, SourceUnquarantined, SourceFailure, MacroWritten, GexSnapshot. The authoritative state lives in market_data.* — audit-bus absence does not roll back the write. Not pageable alone.

Causes: Postgres NOTIFY payload size exceeded; connection briefly lost.
Action: informational. Dashboards should alert on sustained miss rate.

DATA_SOURCE_REGISTRATION_FAILED

Severity: Warning | Category: Infra

A scheduler source builder returned an error at startup: HTTP client init failed, or a required constructor argument was rejected. The source is not registered; the rest of the scheduler still starts. The read-only Binance key-probe failure surfaces here too — gordon-data continues with public endpoints only.

Causes: malformed env var for a specific source; missing API key for FRED / Binance; startup probe timeout.
Action: look at the role = ... error = ... pair. Fix the env or endpoint and restart. Until then, the affected source does not emit rows and /sources/health reports it as stale.

DATA_REPORT_COMPUTATION_FAILED

Severity: Error | Category: Infra

A /backfill/<source>/report handler failed at the DB layer: sqlx::Error on the coverage-count query, or the underlying report-computation runner returned an error. Returns 500.

Causes: same classes as DATA_QUERY_FAILED — pool exhausted, timeout, connectivity blip.
Action: correlate with DATA_QUERY_FAILED. Caller retries once DB health is restored.

DATA_SUBSCRIBER_FAILED

Severity: Warning | Category: Control

source_commands subscriber surface — one code covers start failure, stream-end, cursor commit_offset failure, unknown / malformed variant, and Unquarantine targeting an unknown source. gordon-data exposes a single SourceCommand::Unquarantine variant today so the taxonomy collapses what gordon-bot splits across three codes.

Causes: manager or operator sent an unknown SourceCommand variant (forward-compat); subscriber channel dropped; LISTEN cursor commit lost a race with a pool blip.
Action: informational. If a specific unknown variant shows up repeatedly, the manager is newer than this data build — roll the data service.

MANAGER codes

MANAGER_RECONCILER_DRIFT

Severity: Error | Category: Control

Desired-state reconciler found drift between bot_configs and the running container set that it could not resolve.

Causes: docker-socket-proxy returned unexpected state; container was manually killed outside manager; bot_configs.desired_state and actual container state diverged
max-retry window.
Action: check trading.bot_configs for the affected id. Compare desired_state vs status. Inspect reconciler logs for the specific drift. If the container is zombie, prune it manually then let the reconciler recover.
Escalate: reconciler drift that persists > 5 minutes indicates the reconciler is stuck in backoff — check trading.reconciler_state for the affected bot.

MANAGER_UNAUTHORIZED

Severity: Warning | Category: Control

X-Operator-Token header absent, wrong, or not configured on the manager service.

Causes: GORDON_MANAGER_OPERATOR_TOKEN not set (returns 503); token missing or wrong on request (returns 401 — indistinguishable by design to prevent config side-channel).
Action: verify GORDON_MANAGER_OPERATOR_TOKEN in docker-compose.yml. If the service is returning 503 on all protected routes, the token was not set at startup.

MANAGER_INVALID_IMAGE_TAG

Severity: Error | Category: Control

image_tag on POST /bots or target_image_tag on POST /bots/:id/promote does not match the Docker tag format ^[a-zA-Z0-9][a-zA-Z0-9._-]{0,127}$.

Causes: tag contains slashes, spaces, or special characters; tag is empty; tag is > 128 characters.
Action: fix the tag to match the format. Valid examples: sha-a1b2c3d, v1.2.3, latest.

MANAGER_INVALID_CURSOR

Severity: Error | Category: Control

Pagination cursor failed HMAC MAC verification, base64 decoding, or structural validation.

Causes: cursor was tampered with; GORDON_MANAGER_OPERATOR_TOKEN was rotated between when the cursor was issued and when it was presented; cursor from a different environment (staging vs prod).
Action: discard the cursor and restart pagination from the first page. If this fires after a token rotation, all cached cursors are invalid — expected behaviour.

MANAGER_INVALID_STATE_TRANSITION

Severity: Error | Category: Control

Lifecycle action violates the bot state machine.

Causes: attempting to start an already-running bot; pausing a stopped bot; using PATCH /bots/:id to set desired_state directly instead of the lifecycle endpoints.
Action: use the dedicated lifecycle endpoints: /start, /pause, /resume, /stop. Valid transitions: stopped → running (start); running → paused (pause); paused → running (resume); running|paused → stopped (stop).

MANAGER_BODY_TOO_LARGE

Severity: Error | Category: Control

Request body exceeds the 1 MiB limit.

Causes: oversized bot config payload; accidental binary data sent to a JSON endpoint.
Action: reduce payload size. Bot configs should be < 1 KB in practice — anything approaching 1 MiB is almost certainly a client bug.

MANAGER_INVALID_REQUEST

Severity: Error | Category: Control

Catch-all 4xx on the manager BFF read surface (/runs, /runs/:id, /runs/:id/roundtrips, /runs/:id/equity, /bots/:id/equity). Shape determined by HTTP status:

400 — query-param value rejected: kind not in live|paper|backtest; resolution not in 1m|1h|1d. Body carries a field pointer at the offending parameter.
404 — targeted resource (run by id, bot by id) does not exist. Body carries a field pointer at id.
Causes: console or API caller sent an unsupported enum value; UUID refers to a deleted / never-existed row; race with a concurrent delete.
Action: caller fixes the input. Runs are never hard-deleted in normal operation, so a 404 on a UUID the caller held references usually indicates a stale cache on the caller side (console page that predates a cleanup, or a cursor that crossed a retention cutoff). Not pageable — dashboards should track sustained rate as a UX-health signal, not an infrastructure one.

Distinct from the typed invalid-* variants (MANAGER_INVALID_IMAGE_TAG, MANAGER_INVALID_CURSOR, MANAGER_INVALID_STATE_TRANSITION) which each have dedicated shape + remediation; those fire on write-path validation. This variant is for the read-path (BFF) where the input space is a small fixed enum or a UUID lookup.

MANAGER_INVALID_STRATEGY

Severity: Error | Category: Control

POST /runs body carries a strategy_name that is not registered in the StrategyRegistry. Returned as HTTP 400.

Causes: console sent a strategy name that does not match any entry in the server-side registry (typo, stale dropdown, or the server was redeployed with a different build that removed the strategy).
Action: caller should fetch GET /strategies to get the current registered list and surface it to the user. If the strategy name is correct, the manager binary may be out of date — check the deployed image tag.

Distinct from MANAGER_INVALID_REQUEST which is the catch-all for other 4xx validation failures. This variant is dedicated so the console can surface the registered strategy list in its error UI rather than a generic "bad request" message.

MANAGER_STARTUP_FAILED

Severity: Critical | Category: Infra

Manager service failed to start; process exits non-zero.

Causes: Config::from_env rejected an invalid GORDON_MANAGER_* env var; Tokio multi-thread runtime build failed; openapi export pure-render CLI errored; the HTTP serve loop exited with an error before reaching steady state.
Action: inspect the structured error attached to the event. Most often a missing or malformed env var (URL, port, duration) — fix and restart. If the runtime build failed, check container resource limits (nproc/memory).
Escalate: if restart loops > 3 within 5 minutes, pause the deploy and file a postmortem — a Critical startup failure must not be silently tolerated.

MANAGER_SHUTDOWN_ERROR

Severity: Warning | Category: Infra

Non-fatal shutdown-path error.

Causes: SIGTERM handler install failed (Unix only); ctrl_c listener errored; signal task join failed; reconciler drain-await returned an error.
Action: container replacement still proceeds — this is an observability signal, not a safety event. Check for repeated occurrences across restarts; a flapping shutdown-error pattern usually indicates a stuck reconciler task.

MANAGER_BOOT_DEGRADED

Severity: Warning | Category: Control

Manager booted in a degraded mode but is serving.

Causes: GORDON_MANAGER_RECONCILE_INTERVAL_MS set below the safety floor and clamped up; an optional dependency was missing at startup without failing the process.
Action: audit the config vs the floor printed in the log context. If the clamp is intentional (e.g. tuning), silence the warn via config alignment; if not, fix the config and restart.

MANAGER_DB_TRANSIENT

Severity: Warning | Category: Infra

Transient DB error on a stateless HTTP handler read path.

Causes: Postgres query timeout, pool exhaustion, connection reset; no application- logic fault on the handler side.
Action: returns 500 to the caller; client retries. Confirm Postgres health (pg_stat_activity, pg_stat_replication) if the rate is sustained.
Escalate: a steady rate > 1/s across handlers is a DB incident — escalate to DATA_DB_PROBE_FAILED / Postgres runbook.

MANAGER_RECONCILER_TICK_FAILED

Severity: Warning | Category: Control

Reconciler tick hit a self-healing error and skipped work; next tick retries.

Causes: load_live_configs failed (DB blip); list_bot_containers failed (docker-socket-proxy blip); file-SD write failed (tmp dir eviction); advisory-lock acquire failed (contention); record_success / record_failure / quarantine UPDATE failed (DB blip).
Action: single tick is self-healing — no operator action unless sustained. The noise-floor test asserts zero WARN on an idle reconciler; a flapping signal here means infra instability, not application logic.
Escalate: sustained tick-failure rate rolls up to BOT_QUARANTINED via on_reconcile_error after the per-bot failure threshold.

MANAGER_DEPLOY_STEP_FAILED

Severity: Warning | Category: Control

A single deploy-tick step failed; state machine retries next tick.

Causes: tick_one state-machine step errored for a specific deploy; stop/remove blue on complete_deploy failed (manager must not block completion); stop/remove green on abort_deploy failed.
Action: individual container cleanup fallout is expected to be resolved by the reconciler on the next pass. If a blue container is stuck after a complete, operator can docker rm -f it manually.
Escalate: repeated step failures on the same deploy_id — inspect trading.bot_deploys for the row and review the green/blue state.

MANAGER_DEPLOY_INITIATION_FAILED

Severity: Error | Category: Control

Deploy initiation (kickoff) failed — green/blue flow never started.

Causes: manager could not acquire the shadow lease; bot_deploys insert failed (duplicate in-flight row, FK violation); docker-socket-proxy rejected the green spawn.
Action: for reconciler-initiated kickoffs (auto_deploy=true), on_reconcile_error records + re-attempts. For operator-initiated kickoffs (POST /bots/:id/promote), the HTTP caller sees 500 and can retry after inspecting the structured error.
Escalate: if auto_deploy reconciler-triggered kickoffs fail repeatedly, the bot drifts into quarantine — clear quarantine and investigate the underlying cause.

MANAGER_IPC_PUBLISH_FAILED

Severity: Warning | Category: Infra

Fire-and-forget IPC publish failed; DB row is authoritative.

Causes: ipc_notify_trigger publisher errored on a best-effort path (BotCommand, reconciler event, deploy event, quarantine-cleared, manual deploy-requested).
Action: none — the DB write already committed, and the reconciler / next listener tick converges state regardless. The loss is an audit row in trading.bot_events.
Escalate: sustained publish failure means the notify channel or the publisher is stuck — inspect SELECT * FROM pg_listening_channels() on the listener side.

MANAGER_INTERNAL_ERROR

Severity: Error | Category: Infra

Internal invariant violation; should never fire in practice.

Causes: OpenAPI spec render failed (serde_json::to_string(&*doc) errored on a malformed utoipa::openapi::OpenApi); stdout().write_all failed on backtest summary (closed stdout); role-probe excess-privilege (the startup probe got past the privilege check — fail-safe treats this as internal error).
Action: inspect the structured error. OpenAPI render failures indicate a schema regression — regenerate + retest the spec. Role-probe excess-privilege indicates a privilege-drift on gordon_manager — audit the role grants.
Escalate: file a postmortem. None of these should fire — when one does, it points at a latent bug or a privilege drift.

MANAGER_BACKTEST_FAILED

Severity: Error | Category: Control

Backtest subcommand failed; process exits ExitCode::FAILURE.

Causes: DB pool refused connection (GORDON_DATABASE_URL wrong or DB down); engine returned BacktestError (unknown strategy, invalid params, kline read failed); run row insert failed at sqlx layer.
Action: inspect the structured error for the surface. DB-down → start Postgres. Strategy/param error → fix the CLI invocation. Kline read failed → run make seed to populate market_data.spot_klines for the window.

MANAGER_BACKTEST_ABORTED

Severity: Warning | Category: Control

Backtest aborted cleanly — not a failure of the engine.

Causes: the requested window had no klines in the configured symbol/timeframe; operator hit Ctrl+C (SIGINT) before the engine completed.
Action: for "no klines" — verify market_data.spot_klines coverage with a SELECT MIN(ts), MAX(ts) FROM klines WHERE symbol=... AND timeframe=... query. For SIGINT — no action; the trading.runs row is left with completed_at IS NULL so operators can see it was aborted mid-flight.

MANAGER_UPSTREAM_UNAVAILABLE

Severity: Warning | Category: Infra

BFF pass-through to an upstream service failed.

Causes: gordon-data unreachable (reqwest::Error, connect refused); gordon-data returned non-2xx on /sources/health; response body failed JSON parsing.
Action: manager returns 502 (parse / non-2xx) or 503 (unreachable) to the caller; client retries. Inspect gordon-data's own logs + /healthz to confirm the upstream state.
Escalate: sustained failures on /data/status indicate gordon-data is down or stuck — escalate to DATA_* runbook for that service.

MANAGER_SOURCE_HEALTH_SUBSCRIBER_START_FAILED

Severity: Warning | Category: Infra

The data_events subscriber in gordon-manager failed to start. Source-health state will not be updated until the subscriber recovers (process restart or reconnect). The GET /source-health endpoint returns stale / empty state in this degraded mode.

Causes: Postgres connectivity blip at startup; PgListener::connect returned a sqlx::Error; advisory-channel registration failed.
Action: confirm Postgres health via the manager /healthz probe; the subscriber's outer supervisor restarts the task on the next reconcile tick. No operator intervention required for transient failures.
Escalate: if /source-health stays empty across multiple manager restarts, the data_events channel name or the trading.data_events table is misconfigured — inspect the migration history + the channel-name constant in gordon-manager.

MANAGER_SOURCE_HEALTH_SUBSCRIBER_COMMIT_FAILED

Severity: Warning | Category: Infra

commit_offset failed for a data_events row. The row will replay on the next reconnect — idempotent fold is safe (state is advance-only).

Causes: Postgres connectivity blip during commit; offset-table write hit a pool blip; transaction was aborted by a concurrent operation.
Action: self-healing — the next NOTIFY tick replays the row and the fold re-applies idempotently. Not pageable in isolation.
Escalate: sustained miss rate (> 1% over 10 min) indicates a persistent commit path bug — inspect the manager data_events consumer logs for the underlying sqlx::Error shape.

MANAGER_SOURCE_HEALTH_EVENT_INVALID

Severity: Warning | Category: Infra

A data_events envelope payload could not be decoded as DataEvent. The row is marked consumed (schema-tolerance path) and skipped.

Causes: gordon-data emitted a DataEvent variant unknown to gordon-manager (version skew between services); a malformed row was inserted into trading.data_events by a non-canonical writer.
Action: check gordon-data's version against manager — schema-tolerance is intentional so a newer producer never breaks an older consumer. The skipped row is a missed source-health update, not a correctness issue.
Escalate: if multiple rows are skipped in succession, gordon-data is emitting a variant manager doesn't recognise yet — coordinate the version bump.

SHARED codes

SHARED_DB_CONSTRAINT_VIOLATION

Severity: Error | Category: Infra

A database write violated a uniqueness or foreign-key constraint.

Causes: duplicate insert on a uniqueness constraint (usually idempotency bug); foreign-key violation (referencing a deleted parent row); stale in-memory state diverged from DB.
Action: check the structured error for the table and constraint name fields. For duplicate-key violations, verify the caller is correctly checking for existing rows before insert. For FK violations, verify the parent row exists.
Escalate: if this fires at high frequency from the same service, there is a systematic idempotency gap — file a bug and review the write path.

SHARED_STRATEGY_CONFIG_PARSE_FAILED

Severity: Warning | Category: Infra

Overlay config on bot_configs.strategy_params.overlay failed to deserialize into gordon_strategy::overlays::OverlayConfig. Emitter: extract_overlay_config helper used by gordon-bot's strategy loop (r-02a.1) and gordon-manager's backtest runner (r-02a.2).

Causes: operator-edited JSON with wrong shape (typo'd field name, bad type, unexpected nesting); migration drift if OverlayConfig gains a required field with no #[serde(default)]; manual DB edit bypassing BFF validation.
Behavior: overlays fail open — the helper returns OverlayConfig::default() (all overlays disabled). Bot/backtest continues without the overlay veto layer; strategy emits intents as if overlays were off.
Action: query SELECT id, strategy_params->'overlay' FROM trading.bot_configs WHERE id = <bot_id>. Validate the JSON against the OverlayConfig struct (see gordon-strategy/src/overlays/mod.rs). Fix via manager BFF PATCH /bots/:id with a valid overlay config. No restart required — the next candle tick re-extracts.
Escalate: if this fires across multiple bots simultaneously after a gordon-strategy release, OverlayConfig shape likely changed — check the release notes + add #[serde(default)] to any new field that should be backward-compatible.

BUS codes

Emitted by the leader-elected outbox drain in gordon-bus::nats::outbox_publisher. Added at DP-06 (backbone-audit 2026-05-16) so every drain-loop warn-level log line carries a stable code + clickable URL — operators chasing a 3 AM drain stall do not have to read source.

BUS_OUTBOX_ADVISORY_LOCK_RELEASE_FAILED

Severity: Warning | Category: Infra

The leader-elected outbox drain failed to explicitly release the Postgres advisory lock (OUTBOX_PUBLISHER_LOCK_ID = 0x0B05_0010_2026_0508) on its way out (cancel, error, or graceful exit).

Causes: pg connection dropped before pg_advisory_unlock could run; transient pg error on the release statement; pool-side bug holding the connection beyond the function scope.
Behavior: the lock is also released automatically when the holder's pg connection closes (session-scoped semantics), so this is a degraded-cleanup warning, not a stuck-leader bug. Another instance will pick up the lock after LOCK_RETRY_INTERVAL (30 s) at worst.
Action: monitor frequency. Single occurrences during pod cycling are expected. A sustained pattern (more than one per pod-restart) indicates a pool connection lifetime issue — review PgPool config and any code path that may be holding PoolConnection references.
Escalate: if observed during steady-state (no deploy, no cancel), file a bug — the lock-release path is supposed to be infallible on a healthy connection.

BUS_OUTBOX_DRAIN_LOOP_EXITED

Severity: Warning | Category: Infra

The outbox drain loop returned an error (sustained NATS failure beyond FAILURE_BUDGET = 5 min, pg query failure, listener fatal). Lock is released; another instance has the chance to take over after LOCK_RETRY_INTERVAL.

Causes: NATS broker unreachable beyond the 5-minute budget; Postgres query failure (timeout, pool exhaustion); listener channel hard error.
Behavior: messages remain in bus.outbox with published_to_nats = FALSE. Another drain instance picks them up on next leader acquisition.
Action: check gordon-bus consumer-lag and outbox-backlog gauges. Verify broker reachability (async_nats::client.events()). Confirm another drain instance is running and has acquired the lock (Loki: code=BUS_OUTBOX_DRAIN_LOOP_EXITED + matching acquired advisory lock info on a different host).
Escalate: if no other drain instance picks up the lock within ~5 minutes, every producer's INSERT into bus.outbox accumulates without forward delivery — declare a partial outage of every downstream NATS consumer.

BUS_OUTBOX_LISTENER_RECV_ERROR

Severity: Warning | Category: Infra

The bus_outbox_appended LISTEN channel recv() returned an error.

Causes: transient pg socket flap (connection blip, network partition, pg restart); PgListener internal reconnect machinery surfaced an in-progress reconnect as a recv error.
Behavior: sqlx::postgres::PgListener auto-reconnects internally. The drain loop treats the error as a wakeup and re-polls — no message is lost (re-poll picks up any rows that arrived during the gap).
Action: monitor frequency. Single occurrences during pg cycling are expected. Persistent occurrences indicate flaky pg connectivity — check pg logs and network metrics.
Escalate: if the rate stays above ~1/min for more than 10 minutes, treat as a pg connectivity incident — drain throughput degrades to the IDLE_POLL_INTERVAL (1 s) fallback.

BUS_OUTBOX_NATS_PUBLISH_FAILED

Severity: Warning | Category: Infra

A single outbox row failed to publish to NATS. The drain loop applies exponential backoff (1 s → 30 s cap) and re-attempts the same row on the next pass.

Causes: NATS broker not reachable (network blip, broker restart); JetStream stream not configured for the subject; broker-side rate limit or quota; oversized payload (caught upstream by the bus_outbox_payload_size_cap CHECK, but a misconfigured broker stream limit could also reject).
Behavior: the row stays in bus.outbox with published_to_nats = FALSE. Backoff applies until either the publish succeeds or FAILURE_BUDGET (5 min) elapses — at which point the drain exits with BUS_OUTBOX_DRAIN_LOOP_EXITED.
Action: check the error context for the underlying NATS error. Verify broker reachability and JetStream stream config for the failing subject. For rate-limited rejections, scale broker limits or shed producer load.
Escalate: if every retry attempt fails for the same row across multiple drain instances, the row is structurally undeliverable — file a bug; the producer likely emitted an invalid subject or oversized payload that slipped past the INSERT-time CHECK.

STRATEGY codes

Library-only warnings emitted from gordon-strategy math helpers — called from both gordon-bot (live) and gordon-manager (backtest). Added at the DP-06 raw-tracing cleanup follow-up (2026-05-17). Strategy warnings surface input-shape misconfigurations rather than runtime failures; the math returns a "metric unavailable" sentinel and the caller proceeds.

STRATEGY_DEFLATED_SHARPE_NUMERICAL_ISSUE

Severity: Warning | Category: Data

The deflated-Sharpe / PSR routine refused to compute because the sample size fell below MIN_N_OBS.

Cause: caller passed n_obs < MIN_N_OBS. The PSR formula is numerically unstable below this bound (skew / kurtosis sample estimators are too noisy).
Behavior: function returns None; caller treats as "metric unavailable" and either skips it in the summary or surfaces a downstream "insufficient sample" message.
Action: verify the research pipeline is feeding a window long enough to clear the minimum. Walk-forward configs should size their evaluation slices accordingly. If the bound itself is wrong for a new use case, file a story rather than silently lowering MIN_N_OBS — the numerical-stability constraint is the reason it exists.

STRATEGY_BACKTEST_NON_FINITE_CANDLES_DROPPED

Severity: Warning | Category: Data

BacktestExecution::new dropped one or more input candles whose OHLC contained a non-finite value (NaN / ±Inf).

Cause: upstream data hygiene miss — a malformed candle propagated through the warmup or backfill path into the backtest input set. Without the filter, Decimal::from_f64_retain(NaN).unwrap_or_default() collapses to zero, which trivially satisfies candle_low <= stop_price and triggers a spurious SL fill.
Behavior: the backtest proceeds with the cleaned (finite-only) set. Total-candle count drops by the reported number; downstream fill / equity math sees a contiguous-by-time gap, not a NaN injection.
Action: trace the symbol back to its source (gordon-data ingest) and find the upstream gap or invalid frame. The warning's structured fields (dropped, original_len, symbol) localise the issue. Fix the ingest path, not the backtest filter.

DP-06 follow-up codes (raw-tracing cleanup)

Codes added at the DP-06 follow-up story (plan/active/workspace/raw-tracing-cleanup.md, 2026-05-17) for the 16 pre-existing tracing::warn! sites the original DP-06 story scope deferred. Each variant maps one or more raw-warn call sites to a stable code + clickable URL — operators chasing a degraded surface no longer have to grep source.

DATA_SYMBOL_SUBSCRIPTION_FALLBACK

Severity: Warning | Category: Data

gordon-data symbol-subscription loader returned an empty trading.symbol_subscriptions table at startup and fell back to env-var defaults.

Cause: migration 0022 (which seeds trading.symbol_subscriptions) was not applied, or the seed rows were manually deleted, or this is the no-DB-testability path running against a fresh schema.
Behavior: ingest continues against the fallback set; the persisted enabled = TRUE subscriptions are not honoured for this boot.
Action: verify migration 0022 was applied (SELECT count(*) FROM trading.symbol_subscriptions WHERE enabled = TRUE). If zero, reseed via the migration or gordon-data admin tooling. If this fires on a healthy production stack, the row set has drifted from operator intent.

DATA_BINANCE_TAIL_FALLBACK_FAILED

Severity: Warning | Category: Data

gordon-data /klines handler's Binance tail-fill helper (tail_fill_from_binance) failed to top up an under-filled response window.

Cause: Binance REST unavailable (network blip, upstream rate limit, transient 5xx), a per-symbol gap upstream, or the helper's window math hit an edge case.
Behavior: handler returns whatever DB rows were available (no top-up). The strict-mode warmup gate on the calling bot may reject; lenient consumers see a smaller-than-requested data window.
Action: correlate with Binance status and the per-symbol ingest health in market_data.spot_klines. If sustained for a single symbol, check the upstream ingest source. If sustained across symbols, suspect Binance-side outage or rate-limit budget exhaustion.

EXECUTOR_TEST_REGRESSION_APPLIED

Severity: Warning | Category: Control

gordon-executor applied a test-only intent regression (GORDON_EXECUTOR_TEST_REGRESSION=invert_side) that mutated an inbound intent's side after structural validation.

Cause: the env var is set AND the executor was built with cfg(test) or the test-regressions feature. Production builds do not compile the regression hook at all — this warning cannot fire in a prod container.
Behavior: the inbound intent's side is flipped (Buy ↔ Sell) and submission proceeds with the flipped value. Loud-on-fire so any test fixture leakage into a production-shaped log stream is immediately auditable.
Action: if observed in production logs, the build configuration is wrong — verify the executor was not shipped with the test-regressions feature enabled. Otherwise, no action: this is the e2e harness exercising the regression path.

MANAGER_STACK_HEALTH_UPSERT_FAILED

Severity: Warning | Category: Infra

gordon-manager stack-health aggregator failed to upsert a peer's status row into service_peers after a successful probe tick.

Cause: transient DB error (pool blip, lock contention, timeout). The probe itself succeeded — the warning is on the persistence side.
Behavior: the peer's last_seen_at lags by one tick. The next tick retries the upsert; the /healthz projection observes the older row in the meantime.
Action: monitor sustained miss rate. Single occurrences during DB cycling are expected. Persistent occurrences indicate pool exhaustion or a role-grant drift on gordon_manager — review the pool sizing and migration 0044.

MANAGER_SERVICE_DEPLOY_NATS_CONNECT_FAILED

Severity: Warning | Category: Infra

gordon-manager service-deploy swap-wiring boot probe could not connect to NATS.

Cause: GORDON_BUS_NATS_URL points at an unreachable broker (network partition, broker not yet up, wrong URL). Distinct from a hot-path NATS failure — this fires once during boot.
Behavior: the swap-wiring is disabled for this boot (SwapPending arms log a warn + time out). The rest of the manager continues to serve. No green/blue deploy will complete handshake until the manager is restarted with NATS reachable.
Action: verify broker reachability and GORDON_BUS_NATS_URL correctness. After NATS is up, restart gordon-manager so the swap-wiring is re-attempted at boot.

MANAGER_SERVICE_DEPLOY_SWAP_CONSUMER_SPAWN_FAILED

Severity: Warning | Category: Infra

gordon-manager service-deploy swap-event consumer failed to spawn during boot.

Cause: JetStream consumer creation rejected (stream missing, durable name conflict, broker-side config drift), or the tokio task spawn itself failed.
Behavior: the swap-wiring is partially disabled (publisher + router built but no inbound consumer). The rest of the manager continues to serve. Restart fixes once the underlying cause is resolved.
Action: inspect the error context for the underlying spawn cause. Verify the JetStream stream + durable consumer config; check homelab/ Ansible playbook for any recent NATS topology change.

MANAGER_EXCHANGE_PING_FAILED

Severity: Warning | Category: Infra

gordon-manager /bff/exchange-ping Binance probe HTTP request failed.

Cause: Binance unreachable (network blip, DNS, upstream 5xx, manager-side egress firewall).
Behavior: the handler returns the most recent cached latency (stale) rather than failing the request — the console's status indicator stays live through transient outages. Cached entries TTL out after their configured window.
Action: correlate with Binance status. Sustained failures (cache cold + Binance reachable from other services) indicate the manager-side egress is broken — investigate the manager's outbound HTTP path.

MANAGER_SYMBOLS_UPSTREAM_FAILED

Severity: Warning | Category: Infra

gordon-manager /bff/symbols/available upstream call to Binance /fapi/v1/exchangeInfo failed.

Cause: HTTP error from Binance, non-2xx response, or response body failed to deserialise as BinanceExchangeInfo.
Behavior: the handler returns the cached snapshot (stale) rather than failing the request, so the console keeps working through transient Binance outages. If no cache entry exists, a 503 propagates.
Action: correlate with Binance status. Sustained failures indicate either Binance API schema drift (response body parse failures) or sustained Binance outage — escalate to a manual exchangeInfo refresh or reseed of the cached snapshot.

MANAGER_REPLAY_FILTER_INVALID

Severity: Warning | Category: Control

gordon-manager WS replay handler received a SubscribeFilter carrying a bot_id or run_id value that could not be parsed as UUID v7.

Cause: client / fixture bug — a hostile string slipped through upstream input validation, or a stale console build is sending a non-UUID identifier. Fires from runs / roundtrips / equity-points / overlay-decisions replayers.
Behavior: the filter is ignored; replay falls back to an unfiltered query (the safe degradation path). The client receives more rows than requested but no error.
Action: trace the client value in the structured bot_id / run_id field. Fix the console build or test fixture emitting the bad value. Server-side, this is informational — no manager-side fix is appropriate.

BOT_LEASE_GUARD_DROPPED_WITHOUT_RELEASE

Severity: Warning | Category: Control

gordon-bot LeaseGuard was dropped without an explicit release() call.

Cause: a code path skipped the release call — typically a ?-propagated error before the guard was explicitly released, or a test fixture forcing an abrupt drop. Not a runtime defect when intentional.
Behavior: the Postgres connection close releases the advisory lock server-side a millisecond later (auto-release semantics). The bot_leases row's holder metadata stays stale until the next acquire overwrites it.
Action: if this fires in production, find the dropped path and add explicit release() so server-side bot_leases holder metadata stays accurate. If fixture-only, ignore.

Gordon v7 error codes — operator remediation guide ​

EXECUTOR codes ​

EXECUTOR_UNAUTHORIZED ​

EXECUTOR_CAP_REJECT_PER_ORDER ​

EXECUTOR_CAP_REJECT_PER_BOT_DAILY ​

EXECUTOR_CAP_REJECT_GLOBAL_DAILY ​

EXECUTOR_RECONCILE_DRIFT ​

EXECUTOR_FLATTEN_FAILED ​

EXECUTOR_DB_WRITE_FAILED ​

EXECUTOR_FILL_TRACKER_FAILED ​

EXECUTOR_STARTUP_FAILED ​

EXECUTOR_INTENT_REJECTED ​

EXECUTOR_SUBMIT_FAILED ​

EXECUTOR_DB_TRANSIENT ​

EXECUTOR_INTERNAL_ERROR ​

EXECUTOR_BOOT_DEGRADED ​

EXECUTOR_SHUTDOWN_ERROR ​

EXECUTOR_BOT_COMMAND_FAILED ​

EXECUTOR_BREAK_GLASS_DENIED ​

EXECUTOR_RECONCILE_FIX_FAILED ​

EXECUTOR_IPC_PUBLISH_FAILED ​

EXECUTOR_INVALID_REQUEST ​

EXECUTOR_FLATTEN_STEP_FAILED ​

EXECUTOR_FLATTEN_NO_TARGETS ​

RISK codes ​

RISK_BREAKER_TRIPPED ​

RISK_FLATTEN_REQUESTED ​

RISK_PAUSED ​

RISK_HALTED ​

Full kill-switch contract (Parts 1, 2, 3 all live) ​

RISK_REASON_REQUIRED ​

RISK_REASON_TOO_LONG ​

RISK_INVALID_SCOPE ​

RISK_UNAUTHORIZED ​

RISK_INVALID_BOT_ID ​

RISK_STARTUP_FAILED ​

RISK_SHUTDOWN_ERROR ​

RISK_BOOT_DEGRADED ​

RISK_DB_TRANSIENT ​

RISK_SCHEDULER_TICK_FAILED ​

RISK_SNAPSHOT_MISSING_VPIN ​

RISK_ESCALATION_STEP_FAILED ​

RISK_ESCALATION_SUPPRESSED ​

RISK_FLATTEN_TIMEOUT ​

RISK_CONFIG_PARSE_FAILED ​

RISK_INTERNAL_ERROR ​

RISK_BOT_EVENT_INVALID ​

RISK_SUBSCRIBER_START_FAILED ​

RISK_SUBSCRIBER_COMMIT_FAILED ​

BOT codes ​

BOT_LEASE_LOST ​

BOT_QUARANTINED ​

BOT_INVALID_INTENT ​

BOT_STARTUP_FAILED ​

BOT_SERVE_ERROR ​

BOT_SHUTDOWN_ERROR ​

BOT_DRAIN_STEP_FAILED ​

BOT_DRAIN_BUDGET_EXCEEDED ​

BOT_LEASE_ACQUIRE_TIMEOUT ​

BOT_LEASE_RELEASE_FAILED ​

BOT_LEASE_LIVENESS_FAILED ​

BOT_SWAP_FAILED ​

BOT_SWAP_IGNORED ​

BOT_SWAP_COMMAND_MALFORMED ​

BOT_SUBSCRIBER_START_FAILED ​

BOT_SUBSCRIBER_COMMIT_FAILED ​

BOT_COMMAND_INVALID ​

BOT_CONFIG_RELOAD_FAILED ​

BOT_IPC_PUBLISH_FAILED ​

BOT_ROLE_PROBE_BYPASS_DETECTED ​

BOT_WARMUP_INCOMPLETE ​

BOT_CANDLE_WS_INVALID_FRAME ​

BOT_CANDLE_FALLBACK_ENGAGED ​

BOT_CANDLE_FALLBACK_POLL_FAILED ​

BOT_CANDLE_REJECTED ​

BOT_STRATEGY_LOOP_HALTED ​

BOT_STRATEGY_EVALUATION_ERROR ​

BOT_STRATEGY_PANIC ​

BOT_FENCE_MISMATCH ​

BOT_INTENT_EMIT_FAILED ​

Gordon v7 error codes — operator remediation guide

EXECUTOR codes

EXECUTOR_UNAUTHORIZED

EXECUTOR_CAP_REJECT_PER_ORDER

EXECUTOR_CAP_REJECT_PER_BOT_DAILY

EXECUTOR_CAP_REJECT_GLOBAL_DAILY

EXECUTOR_RECONCILE_DRIFT

EXECUTOR_FLATTEN_FAILED

EXECUTOR_DB_WRITE_FAILED

EXECUTOR_FILL_TRACKER_FAILED

EXECUTOR_STARTUP_FAILED

EXECUTOR_INTENT_REJECTED

EXECUTOR_SUBMIT_FAILED

EXECUTOR_DB_TRANSIENT

EXECUTOR_INTERNAL_ERROR

EXECUTOR_BOOT_DEGRADED

EXECUTOR_SHUTDOWN_ERROR

EXECUTOR_BOT_COMMAND_FAILED

EXECUTOR_BREAK_GLASS_DENIED

EXECUTOR_RECONCILE_FIX_FAILED

EXECUTOR_IPC_PUBLISH_FAILED

EXECUTOR_INVALID_REQUEST

EXECUTOR_FLATTEN_STEP_FAILED

EXECUTOR_FLATTEN_NO_TARGETS

RISK codes

RISK_BREAKER_TRIPPED

RISK_FLATTEN_REQUESTED

RISK_PAUSED

RISK_HALTED

Full kill-switch contract (Parts 1, 2, 3 all live)

RISK_REASON_REQUIRED

RISK_REASON_TOO_LONG

RISK_INVALID_SCOPE

RISK_UNAUTHORIZED

RISK_INVALID_BOT_ID

RISK_STARTUP_FAILED

RISK_SHUTDOWN_ERROR

RISK_BOOT_DEGRADED

RISK_DB_TRANSIENT

RISK_SCHEDULER_TICK_FAILED

RISK_SNAPSHOT_MISSING_VPIN

RISK_ESCALATION_STEP_FAILED

RISK_ESCALATION_SUPPRESSED

RISK_FLATTEN_TIMEOUT

RISK_CONFIG_PARSE_FAILED

RISK_INTERNAL_ERROR

RISK_BOT_EVENT_INVALID

RISK_SUBSCRIBER_START_FAILED

RISK_SUBSCRIBER_COMMIT_FAILED

BOT codes

BOT_LEASE_LOST

BOT_QUARANTINED

BOT_INVALID_INTENT

BOT_STARTUP_FAILED

BOT_SERVE_ERROR

BOT_SHUTDOWN_ERROR

BOT_DRAIN_STEP_FAILED

BOT_DRAIN_BUDGET_EXCEEDED

BOT_LEASE_ACQUIRE_TIMEOUT

BOT_LEASE_RELEASE_FAILED

BOT_LEASE_LIVENESS_FAILED

BOT_SWAP_FAILED

BOT_SWAP_IGNORED

BOT_SWAP_COMMAND_MALFORMED

BOT_SUBSCRIBER_START_FAILED

BOT_SUBSCRIBER_COMMIT_FAILED

BOT_COMMAND_INVALID

BOT_CONFIG_RELOAD_FAILED

BOT_IPC_PUBLISH_FAILED

BOT_ROLE_PROBE_BYPASS_DETECTED

BOT_WARMUP_INCOMPLETE

BOT_CANDLE_WS_INVALID_FRAME

BOT_CANDLE_FALLBACK_ENGAGED

BOT_CANDLE_FALLBACK_POLL_FAILED

BOT_CANDLE_REJECTED

BOT_STRATEGY_LOOP_HALTED

BOT_STRATEGY_EVALUATION_ERROR

BOT_STRATEGY_PANIC

BOT_FENCE_MISMATCH

BOT_INTENT_EMIT_FAILED