Gordon v7 error codes — operator remediation guide
Per-code sections for every ErrorCode variant in gordon-kernel/src/errors/codes.rs. These are the anchor targets linked from remediation_url fields in structured log lines and from the e2e runbook.
Developer reference (when-it-fires, category, severity): gordon-kernel/src/errors/codes.rs.
EXECUTOR codes
EXECUTOR_UNAUTHORIZED
Severity: Warning | Category: Safety
X-Operator-Token header was absent, empty, or incorrect on a protected endpoint (POST /clear-quarantine, POST /flatten).
- Causes: token not set in env (
GORDON_EXECUTOR_OPERATOR_TOKEN); misconfigured client; token rotated without redeploying the caller. - Action: verify
GORDON_EXECUTOR_OPERATOR_TOKENindocker-compose.yml; if unset at startup the service returns 503 on all protected routes. - Escalate: if happening in production and token is correct, suspect replay or misconfigured proxy stripping headers.
EXECUTOR_CAP_REJECT_PER_ORDER
Severity: Error | Category: Safety
Intent notional exceeds max_notional_per_order. Order was not submitted.
- Causes: strategy sizing math produced an oversized intent; cap configured too low for the current account size.
- Action: first, check
GORDON_EXECUTOR_MAX_NOTIONAL_USD_PER_ORDERin compose env. If sizing is correct, raise the cap. If the intent is genuinely oversized, fix the strategy's position-sizing formula. - Escalate: repeated rejects from the same bot indicate a sizing bug — quarantine the bot and review the strategy's volatility-target calculation.
EXECUTOR_CAP_REJECT_PER_BOT_DAILY
Severity: Error | Category: Safety
Intent would push this bot's UTC-day rolling notional past GORDON_EXECUTOR_MAX_DAILY_NOTIONAL_USD_PER_BOT. The cap resets at UTC midnight.
Reinstated 2026-05-17 (DP-01 reactivation). Optional-first design: when the env var is unset, the DailyNotionalGuard runs in warn-only mode — the gordon_executor_daily_notional_would_reject_total{scope="per_bot"} counter increments but no reject fires. When the env var is set, the invariant returns RejectionReason::DailyNotionalExceeded { scope: PerBot, .. } which stamps this code on the rejected intent row.
- Causes: strategy fired too many entries this UTC day (genuine sizing drift vs original budget); per-bot ceiling configured too low for normal operating cadence; reconciler under-counted on restart.
- Action: check
gordon_executor_daily_notional_used_usd{bot_id=...}in Grafana for the bot's current consumption vs ceiling. Verify the day rollover happened correctly (UTC, not local time). If sizing is correct, raise the cap via redeploy with a higherGORDON_EXECUTOR_MAX_DAILY_NOTIONAL_USD_PER_BOT. If sizing is wrong, quarantine the bot and audit strategy params. - Escalate: if rejects fire within the first hour of UTC day rollover, the per-bot ceiling is structurally wrong — either too small for normal cadence, or the cap-reconcile-on-restart logic mis-counted carried-over fills.
EXECUTOR_CAP_REJECT_GLOBAL_DAILY
Severity: Error | Category: Safety
Intent would push the global UTC-day rolling notional (sum across every bot) past GORDON_EXECUTOR_MAX_DAILY_NOTIONAL_USD_GLOBAL. The cap resets at UTC midnight.
Reinstated 2026-05-17 (DP-01 reactivation). Optional-first design: same warn-only / enforce gate as EXECUTOR_CAP_REJECT_PER_BOT_DAILY but keyed off GORDON_EXECUTOR_MAX_DAILY_NOTIONAL_USD_GLOBAL. When unset, gauges fire and the would-reject counter increments under scope="global" but no reject fires.
- Causes: aggregate strategy capacity exceeded the daily ceiling (multiple bots all firing simultaneously); global ceiling configured too low for the total account size + strategy count.
- Action: check
gordon_executor_daily_notional_used_usd_globalvs the configured ceiling. Sum across bots — if every bot is within its per-bot cap but they collectively exceed the global cap, either raise the global cap (redeploy) or pause one or more bots to free headroom. - Escalate: global cap rejects on a single-bot account = misconfiguration (cap should equal the per-bot cap). Multi-bot rejects = the strategy portfolio is over-allocated relative to the operator-declared daily risk budget; pause bots before raising the cap blindly.
EXECUTOR_RECONCILE_DRIFT
Severity: Error | Category: Safety
In-memory order state diverged from trading.orders during reconcile on restart.
- Causes: executor crashed mid-fill update; DB write failure after Binance confirmed the order; concurrent writes (should be impossible — only executor writes this table).
- Action: check
trading.ordersfor rows with inconsistentstatus/filled_qty. Cross-reference with Binance account history (GET /fapi/v1/allOrders). Manually patch the diverged rows, then restart the executor. - Escalate: if reconcile drift occurs on every restart, there is a systematic write failure — check Postgres connectivity and disk space.
EXECUTOR_FLATTEN_FAILED
Severity: Critical | Category: Safety
A flatten operation (break-glass, operator, or risk-initiated) failed at the exchange layer.
- Causes: Binance rejected the market order; connectivity loss to exchange; rate limit hit.
- Action: check the
errorcontext field for the underlying cause. Verify open positions viaGET /positions. Retry the flatten viaPOST /flattenorPOST /executor/break-glass/flatten. Open positions are NOT automatically re-closed after this failure — manual intervention required. - Escalate: if flatten fails repeatedly, assume positions are open. Contact exchange support if API keys are the issue.
EXECUTOR_DB_WRITE_FAILED
Severity: Error | Category: Infra
A DB write during order submission or reconcile failed for a transient reason (connectivity, timeout) rather than a constraint violation.
- Causes: Postgres connection loss; disk full on DB host; connection pool exhaustion.
- Action: check Postgres connectivity from the executor container. Review
trading.ordersfor rows that may have been partially written. Cross-reference with Binance to find orders submitted but not recorded. - Escalate: distinguish from
SHARED_DB_CONSTRAINT_VIOLATION(uniqueness/FK errors). If DB writes fail persistently, stop the executor and reconcile manually.
EXECUTOR_FILL_TRACKER_FAILED
Severity: Error | Category: Infra
Fill tracker failed to acquire or maintain a Binance user-data WebSocket listen_key. Fill events are not being received for the affected network.
- Causes: Binance API key revoked or expired; network connectivity to Binance WS endpoint; rate limit hit on
POST /api/v3/userDataStream. - Action: check Binance API key validity. Review Binance status page for WS outages. Restart the executor to trigger a fresh
listen_keyacquisition. Fills that arrived during the outage will be reconciled on restart.
EXECUTOR_STARTUP_FAILED
Severity: Critical | Category: Safety
Fatal startup failure: configuration invalid, required env vars absent or unparseable, or an internal assertion failed during process initialisation.
- Causes: missing
GORDON_EXECUTOR_*env vars; invalid Binance credentials format; DB connection string malformed; port binding conflict. - Action: check container logs for the specific error. Validate env vars against
.env.example. Verify DB is reachable from the executor container. Fix configuration and restart.
EXECUTOR_INTENT_REJECTED
Severity: Warning | Category: Safety
An intent was rejected by the invariant pipeline. Umbrella code — the reason_code structured field carries the finer identifier (max_notional_exceeded, funding_guard_exceeded, margin_sanity_exceeded, sl_missing_or_invalid, exchange_data_unavailable, missing_strategy, stale_fence, no_lease).
- Causes: bot emitted an intent that exceeds a cap, missed a mandatory SL, raced the fence (stale bot), or the exchange-context lookup (mark price / funding / margin / leverage) failed just before submission.
- Action: individual rejects are expected pipeline surface — alert on sustained rate per
reason_code.exchange_data_unavailablespikes point at gordon-data degradation; other reasons point at the emitting bot's config or exchange state.
EXECUTOR_SUBMIT_FAILED
Severity: Warning | Category: Safety
Submitter composite outcome: one or both legs rejected at the exchange. Cancel-on-fail choreography recovered (SL was cancelled after entry fail, or entry cancelled after SL fail), or the orphan is left for reconcile to rescue. Distinct from EXECUTOR_EXCHANGE_REJECT which documents the raw -XXXX reject surface.
- Causes: hard reject from Binance (invalid symbol, minimum notional violated, reduceOnly sizing wrong);
network_not_configuredon the intent's network; parallel submit raced with a reconfigure. - Action: check the
reasonfield (entry_submission_failed,sl_submission_failed,sl_submission_failed_cancel_failed,network_not_configured).sl_submission_failed_cancel_failedleaves an orphan entry — verify reconcile on the next restart picks it up.
EXECUTOR_DB_TRANSIENT
Severity: Warning | Category: Infra
Transient DB error on a read / listen path: initial catch-up drain failed, PgListener reconnect, post-reconnect drain, or fence lookup failed. The consumer self-heals on the next NOTIFY; not pageable alone. Distinct from EXECUTOR_DB_WRITE_FAILED (write path, order did not get recorded).
- Causes: Postgres restart, network blip between executor and DB, pool exhaustion.
- Action: alert only on sustained rate (> 5 per minute). Check Postgres health
- executor → DB connectivity.
EXECUTOR_INTERNAL_ERROR
Severity: Error | Category: Infra
Internal invariant violation: serde serialisation failure, OpenAPI render failure, or a non-startup assertion failed. Should never fire in practice — when it does, the operator must investigate immediately.
- Causes: a value that should always serialise (redacted config, compiled-in spec) failed; a type change broke a contract; process memory corruption.
- Action: page immediately; restart the executor; open a bug.
EXECUTOR_BOOT_DEGRADED
Severity: Warning | Category: Safety
Executor booted into degraded mode: reconcile quarantined one or more networks so the intent consumer + fill tracker were NOT spawned, or a subsystem (bot-command consumer) failed to spawn. /readyz stays 503 until the operator clears the underlying condition.
- Causes: startup reconcile exceeded the critical-anomaly ceiling (
GORDON_EXECUTOR_MAX_CRITICAL_ANOMALIES); bot-command NOTIFY subscription handshake failed against Postgres. - Action: inspect
trading.reconcile_runsfor the anomaly breakdown. Operator investigates and, if safe, callsPOST /clear-quarantinewith a freshconfirm_token. For bot-command spawn failure, verify the DB pool.
EXECUTOR_SHUTDOWN_ERROR
Severity: Warning | Category: Infra
Abnormal shutdown / serve path: axum serve returned an error, drain budget exceeded, or a signal handler install failed at startup.
- Causes: port already in use when binding (rare — caught earlier); OS-level signal delivery failure; background task hang exceeding the 30 s drain budget.
- Action: inspect the preceding log lines for the underlying cause. Process is exiting; subsequent boot should come up clean unless the root cause (port conflict, stuck task) recurs.
EXECUTOR_BOT_COMMAND_FAILED
Severity: Warning | Category: Control
trading.bot_commands consumer failed to process a flatten command targeted at the executor, or the cursor commit after processing failed. Delivery is at-least-once — a failed commit redelivers the same row on the next NOTIFY; a failed process is absorbed by the idempotent flatten runner.
- Causes: transient DB error while committing the cursor; flatten runner returned an error (which itself is logged as
EXECUTOR_FLATTEN_FAILED). - Action: verify the flatten eventually completed (look for
BotEvent::FlattenStepCompleteon thebot_eventsstream for the targeted network). If the command redelivers indefinitely, investigate the cursor commit path.
EXECUTOR_BREAK_GLASS_DENIED
Severity: Warning | Category: Safety
Break-glass endpoint rejected the request or the dispatched task failed after auth: auth fail, stale confirm timestamp (±60 s window), audit publish failure, or the background flatten task errored. Every variant is audited via BotEvent::BreakGlassInvoked.
- Causes: bearer token mismatch or wrong Authorization header shape; operator's clock skew exceeded 60 s; IPC publish hiccup; ladder errored mid-flatten.
- Action: correlate with the
BreakGlassOutcomeon the audit event (AuthFail,StaleConfirm,MalformedRequest,Accepted). A series ofAuthFail+StaleConfirmis an intrusion signal — rotate the break-glass token immediately.
EXECUTOR_RECONCILE_FIX_FAILED
Severity: Warning | Category: Safety
A single reconcile fix attempt failed (non-contract violation). The mismatch is recorded in trading.reconcile_runs; subsequent reconcile passes on restart, or the fill tracker's replay on WS reconnect, will bridge the gap. Contract violations escalate to EXECUTOR_RECONCILE_DRIFT instead.
- Causes: placing an orphan SL failed at the exchange (insufficient margin, symbol delisted); synthesised-roundtrip INSERT violated a constraint; state mismatch left for operator review.
- Action: inspect the
anomalyfield for the class + theerrorfield for the underlying cause. If orphan-SL placement keeps failing, the operator must manually attach a SL on the exchange before clearing quarantine.
EXECUTOR_IPC_PUBLISH_FAILED
Severity: Warning | Category: Infra
Best-effort IPC publish failed: BotEvent::ReconcileComplete, BotEvent::ReconcileQuarantine, break-glass audit, or flatten step / completion event. The authoritative state lives in the DB row — audit bus absence does NOT roll back the action.
- Causes: Postgres transient error; NOTIFY payload exceeded the Postgres limit (~8 kB); schema-mismatch between the emitted event and a consumer's decoder.
- Action: cross-check against
trading.reconcile_runs/trading.ordersto confirm the action persisted. Alert on sustained rate (> 10 per minute) — that signals a durable IPC problem.
EXECUTOR_INVALID_REQUEST
Severity: Error | Category: Safety
HTTP request failed input validation on an operator endpoint: missing body, unknown network_scope, malformed confirm_token UUID, or a required field was absent. Distinct from EXECUTOR_UNAUTHORIZED (auth failure) — this fires after auth passes but the request body or query is malformed.
- Causes: operator / automation sent a malformed payload; schema drift between console client and executor; typo in a CLI call to
/flattenor/clear-quarantine. - Action: consult the
fieldpointer in theErrorResponsebody + the OpenAPI spec at/docs. Fix the caller; retry.
EXECUTOR_FLATTEN_STEP_FAILED
Severity: Warning | Category: Safety
A single step inside the flatten driver produced a recoverable failure: per-symbol reduce-only limit submit/cancel errored, book fetch failed, position poll hiccupped, or the targeted network was not configured on this executor. The ladder retries on the next step; only a whole-flatten failure (driver-level abort) surfaces as EXECUTOR_FLATTEN_FAILED.
- Causes: transient exchange error; rate-limit hit; symbol not tradable mid-flatten; operator invoked flatten with a network scope this executor has no keys for.
- Action: inspect
reason_codeto localise (network_not_configured,book_fetch_failed,aggressive_limit_failed,market_fallback_failed,symbol_new_rejected,position_poll_failed,step_zero_cancel_failed). Isolated events self-heal; sustained per-symbol streams indicate an exchange issue.
EXECUTOR_FLATTEN_NO_TARGETS
Severity: Warning | Category: Safety / Observability
The flatten ladder dispatched but observed zero non-zero positions across every targeted network — the per-call list_positions() snapshot was empty, so the per-symbol loop ran zero iterations and no flatten trade was produced. Always operationally meaningful: either upstream (gordon-risk / trading.positions) is lying about exposure, the exchange snapshot is out of sync with what the operator expected, or a real flatten was wasted on an already-flat book.
- Causes: stale
trading.positionsghost row (cluster-A trigger-skip bug); drill harness dispatched without a fresh setup position; mock-binance race between fill broadcast and HTTP positionRisk read; operator invoked flatten on a network scope where no positions were open. - Action: cross-check the operator's expected position state against the exchange snapshot at the dispatch timestamp (Loki:
EXECUTOR_FLATTEN_NO_TARGETSnetwork_scope+trace_id). Comparetrading.positions WHERE qty != 0vs/fapi/v2/positionRiskfor the same(network, symbol)set. If they disagree, the trigger or producer is stale; if they agree but the operator expected exposure, the dispatch was misrouted (wrong network scope).
RISK codes
RISK_BREAKER_TRIPPED
Severity: Critical | Category: Safety
A circuit breaker tripped. Context field breaker names the specific breaker (e.g. DrawdownBreaker, VPINBreaker).
- Causes: portfolio drawdown exceeded threshold (DrawdownBreaker); flash-crash VPIN spike (VPINBreaker); correlation density too high (CorrelationBreaker); macro event (MacroBreaker); connectivity loss (ConnectivityBreaker).
- Action: first, check
trading.risk_eventsfor the breaker variant + timestamp. Second, verify the triggering metric has normalized. Then:POST /risk/resumewith areasonfield describing the investigation. - Escalate: if a breaker trips repeatedly within hours, the threshold may need tuning — file a plan entry before adjusting. Never raise thresholds under live stress.
RISK_FLATTEN_REQUESTED
Severity: Critical | Category: Safety
Risk service issued an emergency-flatten instruction to gordon-executor.
- Causes: manual
POST /emergency-flattencall; automatic escalation after a circuit breaker remained tripped past the escalation window. - Action: check
trading.risk_audit_logfor the flatten scope and reason. Verify all positions are closed on exchange (GET /fapi/v2/positionRisk). Do not resume until the triggering condition is understood and resolved. - Escalate: unexpected automatic flattens (no manual trigger) indicate an escalation state machine bug — check the
EscalationManagerstate in risk service logs.
RISK_PAUSED
Severity: Warning | Category: Safety
Risk service paused one or more bots.
- Causes: circuit breaker tripped and the breaker outcome is
PauseBots(not flatten); manualPOST /bots/:id/pausecall. - Action: check
trading.risk_events+bot_configs.status. Resume viaPOST /risk/resumeonce the triggering metric normalises. - Escalate: bots that stay paused > 24h are likely stuck in escalation — check
trading.risk_audit_logand theEscalationManagerstate.
RISK_HALTED
Severity: Error | Category: Safety
The executor rejected a fresh order intent because the risk-halt latch (trading.risk_state.halted = TRUE) is engaged. The latch flips TRUE on POST /risk/emergency-flatten and on any circuit-breaker trip whose outcome is Flatten; it clears only on POST /risk/resume. Rejected intents carry order_intents.outcome = 'rejected' with outcome_reason = 'risk_halted'.
Full kill-switch contract (Parts 1, 2, 3 all live)
- Operator (or breaker) triggers
POST /emergency-flatten. - gordon-risk flips
trading.risk_state.halted = TRUEatomically with thebot_commands+risk_eventsaudit writes. The halted-column transition firespg_notify('risk_halt_changed', ...). - gordon-executor's
RiskHaltStatewatcher (LISTEN onrisk_halt_changedplus a 5-second polling fallback) picks the flip up and updates its in-memory snapshot. - Every subsequent intent reaching the intent-consumer hits the halt gate before fence / invariant / submitter checks. The row is marked
rejectedwithoutcome_reason = 'risk_halted'; aBotEvent::IntentRejectedis published; no exchange submission occurs. - Operator decides the system is safe to resume →
POST /risk/resume. gordon-risk flipshalted = FALSE; the trigger fires arisk_halt_changedNOTIFY; the executor watcher adopts the new value. - Fresh intents flow through the normal path again.
The executor fails closed: if its DB read against trading.risk_state errors (missing grant, table missing, transient pool failure), the snapshot flips back to halted = TRUE and retries every 5 seconds. An executor that cannot verify the halt state is safer halted than submitting blind.
- Causes: operator hit the kill switch (
POST /emergency-flatten) or a circuit breaker tripped with a flatten outcome; the latch has not been cleared since. - Action: verify open positions are flat (
GET /positionson executor). Checktrading.risk_audit_logfor the halt trace_id and the triggering cause. Once the condition is understood and safe to resume,POST /risk/resumewith a reason to clear the latch. Until resume succeeds, every fresh intent will be rejected. - Escalate: if the latch re-engages immediately after resume, a breaker is stuck in a trip loop — check
trading.risk_eventsfor the triggering metric and pause the offending bot(s) before resuming. Never bypass the latch. - Diagnose: every executor rejection logs
code="RISK_HALTED"withhalt_trace_idequal to thetrading.risk_audit_logrow that engaged the latch — grep Loki for the trace id to correlate.
RISK_REASON_REQUIRED
Severity: Error | Category: Safety
reason field absent or blank on a POST /emergency-flatten, POST /risk/resume, or POST /bots/:id/pause request.
- Causes: API caller omitted the
reasonfield; sent an empty string. - Action: add a non-empty
reasonstring to the request body. See the risk service API for the expected schema. - Escalate: if this fires from an automated caller, fix the caller to always include a
reasondescribing the automated context.
RISK_REASON_TOO_LONG
Severity: Error | Category: Safety
reason field exceeds the 500-character limit on an operator risk endpoint request.
- Causes: automated caller concatenating unbounded log context into the
reasonfield. - Action: truncate the
reasonto ≤ 500 characters.
RISK_INVALID_SCOPE
Severity: Error | Category: Safety
scope on POST /emergency-flatten is not one of the accepted variants ("all", "bot:<uuid>", "symbol:<SYMBOL>", "cluster:<id>").
- Causes: typo in scope string; unsupported variant attempted.
- Action: use one of the four accepted formats exactly. For bot-scoped flatten, use the UUID v7 from
bot_configs.id.
RISK_UNAUTHORIZED
Severity: Warning | Category: Safety
X-Operator-Token header absent, wrong, or token not configured on the risk service.
- Causes:
GORDON_RISK_OPERATOR_TOKENnot set; client sending wrong token; token rotated without updating the caller. - Action: verify
GORDON_RISK_OPERATOR_TOKENindocker-compose.yml. If token is not configured at startup, all protected endpoints return 503.
RISK_INVALID_BOT_ID
Severity: Error | Category: Safety
id path parameter on POST /bots/:id/pause is not a valid UUID v7.
- Causes: caller sending a non-UUID id (integer, slug, truncated UUID).
- Action: use the UUID v7 from
bot_configs.idas the path parameter.
RISK_STARTUP_FAILED
Severity: Critical | Category: Infra
Fatal startup failure — Config::from_env rejected an env var, the Postgres pool could not be opened, or the serve loop returned an error. Process exits with ExitCode::FAILURE (or ExitCode::from(2) for config errors).
- Causes: missing / malformed env var (e.g.
GORDON_RISK_BIND_ADDR,GORDON_DATABASE_URL); Postgres unreachable; a migration / role-probe failed. - Action: read the structured
errorfield in the log line — it carries the underlyinganyhow::Error/sqlx::Error. Fix the config or the DB surface and restart. For DB errors, confirmgordon-migratehas run. - Escalate: a startup that never succeeds is a deployment blocker; the container orchestrator will crash-loop. Investigate immediately.
RISK_SHUTDOWN_ERROR
Severity: Warning | Category: Infra
Non-fatal shutdown-path surface: axum serve returned an error, the drain budget was exceeded, a signal-handler install failed, or the scheduler task join returned an error on teardown. Process is exiting anyway; log fidelity matters for the postmortem.
- Causes: scheduler task panicked (code bug); in-flight request took longer than the drain budget; signal handler could not be installed (OS limits).
- Action: capture the structured
errorfield, cross-reference with any scheduler-panic stack trace printed earlier. Kept at WARN per op-07c; dashboards should alert only on sustained rate. - Escalate: scheduler-panic-at-shutdown is a code bug — file an issue with the stack trace and the commit SHA.
RISK_BOOT_DEGRADED
Severity: Warning | Category: Infra
Risk booted into a degraded mode. Two main surfaces:
- Empty
data_freshness(ConnectivityBreaker): no bot has ever posted atrading.bot_eventsrow. The breaker returnsNoopso cold-start does not fire an emergency, but this is unexpected if bots are supposed to be running. - FRED macro data absent (MacroBreaker):
market_data.macro_datahas no DXY / VIX rows. The breaker returnsNoop; gordon-data's FRED fetch has not populated yet.
- Action: confirm that gordon-bot instances are running (
docker ps) and thattrading.bot_eventsis non-empty. For macro data, check gordon-data's FRED ingestor (POST /warmupand the macro tables). - Escalate: > 10 min of
RISK_BOOT_DEGRADEDafter all bots should be online means the data pipeline has a wiring gap.
RISK_DB_TRANSIENT
Severity: Warning | Category: Infra
Transient DB error on a read / listen path: PgListener connect failed, the LISTEN statement failed, recv returned an error, a bot_events row lookup hit a pool blip, or the scheduler snapshot query returned a sqlx::Error.
- Causes: brief Postgres restart, pool exhaustion, TCP reset during a lingering connection. Self-heals — the listener reconnects after a 5s sleep, the scheduler retries on the next tick.
- Action: check Postgres logs for a restart or a
FATALline around the timestamp. If the pattern persists, raise the pool size or investigate the network between risk and postgres. - Escalate: sustained
RISK_DB_TRANSIENTbursts rolling every 5s indicate Postgres is not healthy — page the DB on-call.
RISK_SCHEDULER_TICK_FAILED
Severity: Error | Category: Safety
Scheduler decided to act (a breaker fired) but the risk_events + bot_commands transaction failed to commit. The commanded action did NOT reach executor or bots — they will not flatten / pause until the next cycle retries.
- Causes: Postgres outage mid-transaction; constraint violation on the
bot_commandsrow (should not happen with the current schema); a new breaker was added that produces anevent_typethe scheduler's match arm doesn't know. - Action: pageable. Read the
error+breakerfields, confirm the commit did NOT land (SELECT * FROM trading.bot_commands WHERE trace_id = …). If positions are exposed and the breaker's intent was to flatten, triggerPOST /emergency-flattenmanually while you diagnose. - Escalate: immediate — a safety-critical write that didn't land is a protection gap.
RISK_SNAPSHOT_MISSING_VPIN
Severity: Error | Category: Data
Scheduler found an active position on a symbol for which market_data.metrics.vpin returns zero rows. The VPIN breaker (flash-crash kill switch) cannot evaluate blind, so the whole cycle is skipped — the portfolio is unprotected against VPIN-grade events until the gap closes.
- Causes: gordon-data's
derived_vpin.rssource is not running, is lagging more than one hour, or the symbol is genuinely new and VPIN has not been derived yet. - Action: pageable. Check gordon-data's
/readyzand the VPIN source status (/data/sources). Verify the affected symbol appears inSELECT DISTINCT symbol FROM market_data.metrics WHERE vpin IS NOT NULL. - Escalate: every minute of missing VPIN on an active position is a minute of unprotected exposure. Do not widen this gap — pause the bot with the missing-VPIN symbol while the data source recovers.
RISK_ESCALATION_STEP_FAILED
Severity: Warning | Category: Safety
Best-effort audit row write in the escalation state machine failed. Affects three specific rows:
flatten_requested— logged when a flatten watcher starts.flatten_completed— logged whenFlattenStepCompletearrives on time.lease_revoked— logged after the 30s timeout clearsholder_bot_id.
The commanded action (flatten command on bot_commands; lease clear on bot_leases) is committed in a separate transaction and already succeeded — this log means the audit trail has a gap, not that the action rolled back.
- Action: cross-reference
trading.risk_audit_logwith thetrace_idfield. If the row is truly missing, file a backfill ticket — the action side is already recorded intrading.bot_commands/trading.risk_events. - Escalate: sustained rate on this code degrades auditability. Check the
errorfield for pattern (FK violation vs pool exhaustion).
RISK_ESCALATION_SUPPRESSED
Severity: Warning | Category: Safety
Escalation manager rejected a new flatten registration for one of two reasons:
- In-flight: a watcher is already active for the same
scope— second registration would collide. Retry-storm guard. - Cooldown:
scopeis within the 60-second post-completion cooldown; a fresh flatten right after a successful one is vacuous (portfolio already flat).
The caller's bot_commands row was already committed before the rejection — only the escalation tracking is a no-op.
- Action: this is an expected operational surface during operator drill retries or breaker oscillation. If the rate is higher than expected, inspect
trading.risk_audit_logfor the firing cadence. - Escalate: a cooldown-guarded scope that fires repeatedly after 60s hints that positions are being re-opened faster than the flatten can close them — investigate the bot side.
RISK_FLATTEN_TIMEOUT
Severity: Error | Category: Safety
Escalation watcher timed out: no FlattenStepComplete event arrived on the bot_events channel within the 30-second window. Risk cleared holder_bot_id on every active bot_leases row (UPDATE … SET holder_bot_id = NULL) so bots cannot re-acquire the lease and keep trading.
Note: risk does NOT bump the fence — that is executor's job when it processes the flatten command. This log fires when the executor is either dead or unable to reach the exchange; the safe action is to revoke leases on the risk side.
- Action: pageable. Confirm executor is alive (
GET /healthzon executor). Check executor logs forEXECUTOR_FLATTEN_FAILED/EXECUTOR_FLATTEN_STEP_FAILEDat the same trace_id. Manually verify positions are flat on the exchange (GET /fapi/v2/positionRisk). If positions remain open, trigger a manual flatten via the Binance UI / API. - Escalate: immediate. Risk has given up and revoked leases; executor must be restored before any resume.
RISK_CONFIG_PARSE_FAILED
Severity: Warning | Category: Infra
A trading.risk_config row value could not be parsed as the expected type (decimal, integer, float, array, or JSON object). The breaker falls back to the compiled-in default so evaluation never stalls; the operator should fix the row so the intended threshold is honoured.
- Causes: operator wrote a typo into
value(e.g."0.10"as string instead of0.10numeric); JSON schema changed without updating seeds; fresh install with a missing row. - Action:
SELECT key, value FROM trading.risk_config WHERE key = '<key>'; fix the row to the expected JSON-native type. Also covers the defensive "unknown breaker name" surface in the scheduler where a new breaker was added without updating the event-type match arm. - Escalate: sustained unknown-breaker warnings on the scheduler mean the breaker taxonomy is drifting — update the match arm in
scheduler.rs.
RISK_INTERNAL_ERROR
Severity: Error | Category: Infra
Internal invariant violation — should never fire in practice. Four surfaces:
OpenAPIrender failed (openapi exportor the/openapi.jsonhandler).stdoutwrite failed duringopenapi export -.Redacted-config serialiser returned a
serde_json::Erroron the/confighandler.ConnectivityBreakerreceived a non-emptydata_freshnessbut.values().copied().max()returnedNone(defensive unreachable arm).Action: capture the
errorfield. ForOpenAPI/ serialise surfaces, this typically points to a type that is notSerialize/ToSchema— a code change caused the regression. For the breaker defensive arm, verifyPortfolioState::snapshotinvariants with a debug build.Escalate: file a bug — every fire here is a code-level regression.
RISK_BOT_EVENT_INVALID
Severity: Warning | Category: Control
The trading.bot_events NOTIFY listener received a payload it cannot act on:
- The payload is not a parseable
i64row id. - A
flatten_step_completeevent row has atrace_idcolumn that is missing or not a valid UUID.
The row is skipped. For flatten_step_complete specifically, the escalation watcher for the corresponding scope will time out after 30s (RISK_FLATTEN_TIMEOUT) and revoke leases — the system stays safe.
- Causes: another service wrote a row with a malformed trace_id (producer bug); NOTIFY payload format drift (schema-version mismatch).
- Action: inspect the
payload/row_id/trace_idfields and locate the producing service. The row is recorded intrading.bot_events; fix the producer. - Escalate: if the pattern repeats, it indicates a serialisation bug in the executor (producer of
flatten_step_complete).
RISK_SUBSCRIBER_START_FAILED
Severity: Critical | Category: Control
The escalation watcher's shared PostgresSubscriber failed to start its background catch-up + LISTEN loop on bot_events. Without this loop, risk cannot react to flatten_step_complete / flatten_step_failed signals — every flatten escalates by default at the 30s RISK_FLATTEN_TIMEOUT even when the bot completed flatten in seconds.
- Causes: Postgres unreachable at startup, role grants regressed (gordon_risk lost SELECT/UPDATE on
pipeline_stateorbot_events),pipeline_statetable missing, schema drift on the cursor row shape. - Action: check Postgres connectivity from gordon-risk; verify gordon_risk grants via
cargo test -p gordon-migrate --test grant_matrix; inspect gordon-risk startup logs for the underlying sqlx error. - Escalate: if the failure persists after restart, the operational impact is the entire flatten escalation pipeline degrades to default-30s behaviour. Treat as P0.
RISK_SUBSCRIBER_COMMIT_FAILED
Severity: Warning | Category: Control
The escalation watcher consumed a bot_events row but failed to commit the cursor offset back to pipeline_state for the risk-escalation consumer. The side-effect already fired (escalation registered, lease revoked, etc.) — the risk is at-least-once delivery: on next risk restart, the same row will be replayed and the side-effect will run twice.
- Causes: Postgres transient (connection drop, pool exhaustion); gordon_risk lost UPDATE on
pipeline_state.consumed_at; long write contention on the cursor row. - Action: replay tolerance — verify the escalation handler is idempotent (same
trace_idshould produce same outcome). Check Postgres health + connection pool metrics. Inspect logs for the underlying sqlx error. - Escalate: if commits fail repeatedly, replays could amplify side-effects (multiple lease revocations, double-counted breaker trips). Page on-call if rate exceeds 1/min.
BOT codes
BOT_LEASE_LOST
Severity: Error | Category: Control
Bot's advisory lease expired before renewal; bot must pause and re-acquire.
- Causes: DB connectivity interruption preventing lease refresh; lease-refresh goroutine panicked; system clock skew between container and DB host.
- Action: check DB connectivity. The bot should pause and attempt lease re-acquisition automatically. If the bot does not recover within 60 s, the manager reconciler will restart it.
- Escalate: repeated lease loss from the same bot indicates a systematic connectivity or clock issue.
BOT_QUARANTINED
Severity: Critical | Category: Control
Manager placed the bot in quarantine due to repeated failures.
- Causes: bot exceeded the reconciler's failure threshold (consecutive restart failures); signal-emit failures not self-healing; lease loss not self-healing.
- Action: inspect logs from the last N bot restarts:
docker compose logs gordon-bot --tail=200. Fix the root cause (DB connectivity, strategy bug, config error). Then:POST /bots/:id/clear-quarantine?confirm=YES-<iso-ts>. - Escalate: if the bot quarantines immediately after clear, fix the root cause before attempting another clear. A bot that quarantines within 5 minutes of clear indicates an unresolved code or config bug.
BOT_INVALID_INTENT
Severity: Error | Category: Control
POST /test/emit-intent body failed validation: unknown fields, unparseable JSON, body > 4 KiB, or qty/sl/tp semantic constraints violated.
- Causes: test endpoint called with malformed JSON; numeric fields are negative or zero; body is oversized.
- Action: fix the request body. This endpoint is only available when
GORDON_BOT_STRATEGY=manual; it is not registered in production strategy mode. - Escalate: if this fires in a non-manual deployment, a misconfigured
GORDON_BOT_STRATEGYvar may have unintentionally exposed the test endpoint.
BOT_STARTUP_FAILED
Severity: Critical | Category: Infra
Bot failed pre-serve startup: missing --bot-id / GORDON_BOT_ID, config-loader rejection, injected failure gate (GORDON_BOT_FAIL_STARTUP=true, exit code 73), or openapi export CLI render / write failure.
- Causes: bot spawned without a
bot-idarg or env;bot_configsrow missing; env var type mismatch; injected-failure flag set (fidelity 06); filesystem / stdout write failure during spec export. - Action: check container env (
docker compose config gordon-bot) forGORDON_BOT_IDand strategy / candle-source vars. Verify thetrading.bot_configsrow exists for the id. For injected failures, unsetGORDON_BOT_FAIL_STARTUPand restart. - Escalate: manager spawned the bot with missing env (see gordon-manager logs for the deploy driver).
BOT_SERVE_ERROR
Severity: Error | Category: Infra
Axum serve returned an error after the server accepted traffic — abnormal exit distinct from a clean drain.
- Causes: kernel TCP listener error; runtime panic in a request handler; OOM kill mid-serve.
- Action: check
docker compose logs gordon-bot --tail=200for a trailing panic or socket error. Manager will respawn ondesired_state=running. - Escalate: repeated
BOT_SERVE_ERRORwithout a preceding explanatory log indicates kernel or runtime-level instability on the host.
BOT_SHUTDOWN_ERROR
Severity: Warning | Category: Infra
Abnormal shutdown: signal handler install failed at startup, or the drain deadline (30 s total) was reached before the drain coordinator advanced past step 6.
- Causes: rare kernel signal-registration failure; a drain step stalled and tripped the
futures_pendingtimeout. - Action: confirm process exited; advisory lock will auto-release on connection drop. Manager reconciler takes over.
- Escalate: correlate with
BOT_DRAIN_STEP_FAILED/BOT_DRAIN_BUDGET_EXCEEDEDon the same container to identify the stalled step.
BOT_DRAIN_STEP_FAILED
Severity: Warning | Category: Infra
A single drain step (finish_candle, flush_state, close_listeners, release_lease, emit_drained) failed or timed out against its sub-budget. The step field names which step slipped. Drain keeps moving forward.
- Causes: strategy loop wedged; Postgres
UPDATE bot_strategy_stateslow; listener join hung; lease release sqlx timeout. - Action: usually self-healing — advisory lock auto-releases and the reconciler handles residue. Review the preceding log line for the specific step error.
- Escalate: sustained per-step failures across deploys point at a systemic Postgres latency or network issue.
BOT_DRAIN_BUDGET_EXCEEDED
Severity: Warning | Category: Infra
Drain elapsed past the 25 s slow threshold or exceeded the 30 s total budget. Process still exits 0 per story 16.8 AC. The stalled_step field names the step that crossed the threshold.
- Causes: cascaded step failures, Postgres overload, wedged liveness task.
- Action: manager reconciler will respawn the bot if still desired; the advisory lock auto-releases on the connection drop.
- Escalate: chronic drain slowness correlates with upstream Postgres degradation — check
pg_stat_activityfor long-held locks or connection saturation.
BOT_LEASE_ACQUIRE_TIMEOUT
Severity: Error | Category: Control
LeaseGuard::acquire / acquire_shadow did not obtain the advisory lock within the startup budget (30 s, 100 ms → 2 s backoff).
- Causes: another bot container holds the lease for the same
(symbol, strategy); stale advisory lock on a disconnected backend. - Action: inspect
trading.bot_leases+pg_locksto identify the holder. If the holder is stale, kill the owning Postgres session withpg_terminate_backendto drop the advisory lock. - Escalate: repeated timeouts across restarts indicate a rogue bot instance running elsewhere — check Ansible inventory for duplicate container provisioning.
BOT_LEASE_RELEASE_FAILED
Severity: Warning | Category: Control
LeaseGuard::release failed on the graceful-exit path. Not fatal — the advisory lock auto-releases a moment later on connection drop.
- Causes: Postgres unreachable during graceful exit; network partition at shutdown.
- Action: verify the lock auto-released via
pg_locks; the next process start reacquires. - Escalate: pattern across several bots hints at a database-level issue.
BOT_LEASE_LIVENESS_FAILED
Severity: Warning | Category: Control
Lease liveness probe returned a sqlx::Error during the renewal cadence. The loop trips halt, exits cleanly, and a fresh acquire path runs on the next process start.
- Causes: transient Postgres connectivity blip; advisory lock connection closed unexpectedly; statement timeout.
- Action: self-healing on next start. If repeating, check gordon-postgres health.
- Escalate: paired with
BOT_LEASE_LOST(Critical) — that's a lock-loss event, investigate PG session tracking immediately.
BOT_SWAP_FAILED
Severity: Error | Category: Control
Green/blue swap handshake failed: AcquireActive timed out (blue didn't release in time), upgrade_shadow_to_active returned a non-timeout SQL error, or the blue PrepareSwap reply indicated failure. Manager aborts the deploy.
- Causes: blue didn't receive
prepare_swap; blue exited between receiving and replying; DB unavailable during swap; fence contention. - Action: green exits non-zero, manager aborts the deploy, blue remains active. Check manager deploy-driver logs for the matching
deploy_abortedevent. - Escalate: repeated swap failures block deploys — review
bot_deploysstate machine in gordon-manager.
BOT_SWAP_IGNORED
Severity: Warning | Category: Control
Swap command deferred / ignored: channel not wired (legacy test path), wrong mode (active bot received acquire_active or shadow received prepare_swap), swap channel closed because the liveness loop already exited, or the reply channel was dropped.
- Causes: manager issued a swap command to a bot in the wrong mode; test harness path without swap-channel wiring; bot already draining when command arrived.
- Action: audit event
swap_deferredorswap_role_pendingcarries the reason; manager retries or aborts based on the state machine. - Escalate: not pageable alone; pattern with
BOT_SWAP_FAILEDis the concern.
BOT_SWAP_COMMAND_MALFORMED
Severity: Warning | Category: Control
Swap command payload missing required swap_id or deploy_id. Command is dropped; the swap_command_malformed audit event is published.
- Causes: manager bug emitted a partial payload; schema drift between manager and bot.
- Action: investigate the originating manager deploy for the swap envelope shape.
- Escalate: pins a schema-version mismatch — coordinate gordon-contracts + gordon-manager versions.
BOT_SUBSCRIBER_START_FAILED
Severity: Warning | Category: Infra
Failed to open a PgListener subscription for one of the inbound channels (order_events, fill_events, bot_commands). That listener exits; the rest of the bot keeps running.
- Causes: Postgres unreachable at listener start; connection-pool saturation; PgListener session limit.
- Action: missing
bot_commandsmeans operator commands no longer reach the bot — restart the container. Missingfill_eventsmeans fills don't update strategy state until restart (reconcile on next start catches up). - Escalate: pattern across services indicates PG connection ceiling hit.
BOT_SUBSCRIBER_COMMIT_FAILED
Severity: Warning | Category: Infra
commit_offset for a consumed NOTIFY row failed. The row will replay on next drain; the LRU dedup suppresses double application of side effects.
- Causes: transient Postgres connectivity; cursor row contention.
- Action: self-healing via replay + idempotency; no operator action unless sustained.
- Escalate: sustained failures imply pipeline_state write path unhealthy.
BOT_COMMAND_INVALID
Severity: Warning | Category: Control
trading.bot_commands row malformed: row lookup after NOTIFY failed, the row disappeared between NOTIFY and SELECT, the command column is NULL, or the variant is unknown (forward-compat surface).
- Causes: race between NOTIFY and row-cleanup job; operator issued an unsupported variant; manager feature flag left stale.
- Action: the specific row is dropped; legitimate commands continue to dispatch. Check
trading.bot_commandsfor orphans. - Escalate: unknown variants across deploys warrant a shared-schema bump.
BOT_CONFIG_RELOAD_FAILED
Severity: Warning | Category: Control
reload_config command failed: the bot_configs row reload errored, or the registry refused to re-instantiate the strategy against the new params. The running instance is preserved.
- Causes: operator edited
bot_configs.strategy_paramswith invalid JSON; registry schema validation rejects new params; DB unavailable. - Action: fix the
strategy_paramsJSON per the strategy's schema (GET /strategies/:name/schema). Re-issuereload_configonce valid. - Escalate: strategy-schema drift between bot image and operator's config source requires coordinated rollout.
BOT_IPC_PUBLISH_FAILED
Severity: Warning | Category: Infra
Best-effort bot_events publish failed (heartbeat, drain, lease, fallback, strategy events). The DB row is authoritative; audit-bus absence does not roll back the action.
- Causes: Postgres unreachable;
trading.bot_eventsRLS rejected the insert (expected benign surface — see op-07e log-level rebalance: the heartbeat publisher stamps the RLS GUC before inserting but a transient restart can lose the GUC). - Action: self-healing; not pageable unless sustained. Correlate with gordon-postgres health.
- Escalate: sustained publish miss rate → manager dashboards lose the bot's audit trail; investigate the RLS
app.bot_idGUC stamping path.
BOT_ROLE_PROBE_BYPASS_DETECTED
Severity: Warning | Category: Control
Startup permission probe (story 16.9 / op-21) ran a statement that should have been rejected at 42501 (insufficient_privilege) but got past the privilege check. Treated as excess-privilege (fail-safe): the role has more grants than intended.
- Causes:
gordon_botPostgres role mis-configured; migration 0044 rolled back; manualGRANTrun against production DB. - Action: STOP DEPLOYS. Audit
\du gordon_bot+\dp trading.*against migration 0044. Revoke excess grants before any bot is restarted. - Escalate: this is a security-posture event — notify infra channel + file an incident report.
BOT_WARMUP_INCOMPLETE
Severity: Error | Category: Data
Warmup (POST /warmup on gordon-data) returned an incomplete dataset or warnings that failed the strict-mode completeness check. The bot aborts boot.
- Causes: gordon-data missing historical rows for the requested lookback; partition out of range; live ingest lagging startup.
- Action: check gordon-data
/healthz+/readyz. Runmake seed-klines+make fill-gaps+make precomputeto backfill the lookback window. - Escalate: persistent warmup failures after backfill point to gordon-data-side aggregation or partitioning bugs — check the
DATA_AGGREGATION_ERRORlog stream.
BOT_CANDLE_WS_INVALID_FRAME
Severity: Warning | Category: Data
Candle WS received a frame the bot cannot use: JSON parse failure, server error frame, or payload failed shared-type validation (e.g. unknown symbol).
- Causes: gordon-data WS protocol drift; malformed upstream Binance frame; schema-version mismatch.
- Action: driver logs + reconnects. Check gordon-data logs for the corresponding server-side error.
- Escalate: sustained bad frames correlate with gordon-data / gordon-contracts WS schema mismatch — pin versions.
BOT_CANDLE_FALLBACK_ENGAGED
Severity: Warning | Category: Data
Candle REST fallback engaged (WS down > 30 s) or escalated to degraded tier (engaged
10 min). Operator investigation required at the degraded tier.
- Causes: gordon-data WS down; network partition between bot and gordon-data; gordon-data OOM/crash.
- Action: check gordon-data
/healthz+ Docker status. Fallback will auto-disengage when WS reconnects; degraded tier means manual investigation is overdue. - Escalate: > 10 min fallback → paging event; prolonged fallback skews candle freshness and may stall strategies.
BOT_CANDLE_FALLBACK_POLL_FAILED
Severity: Warning | Category: Data
A single fallback-poll attempt failed (HTTP error from gordon-data, or fallback engaged before warmup seeded a cursor). Next 5 s tick retries.
- Causes: gordon-data REST unreachable; 5xx response; warmup race.
- Action: self-healing on next tick.
- Escalate: sustained poll failures imply gordon-data-side degradation.
BOT_CANDLE_REJECTED
Severity: Warning | Category: Data
A scripted candle fixture emitted a candle rejected by the runtime-state cursor (duplicate or out-of-order close_time_ms). Test / fidelity path only; the scripted source never runs in production.
- Causes: fixture authored with overlapping rows; replay bug in the scripted driver.
- Action: fix the fixture JSON; inspect
close_time_msmonotonicity. - Escalate: if seen in production logs, a production container is running the scripted source — misconfiguration, halt deploy.
BOT_STRATEGY_LOOP_HALTED
Severity: Warning | Category: Control
Strategy loop halted on a non-evaluation surface: ChannelSink send failed (downstream already dead), lease halt flag tripped mid-candle, persist_state SQL errored on the no-signal path, serialize_state failed, or Postgres NOW() probe failed.
- Causes: downstream consumer dropped; halt flag flipped by listener / drain; transient Postgres error on state persistence.
- Action: loop exits cleanly; process reaper (graceful_shutdown) handles the rest. Manager respawns if
desired_state=running. - Escalate: correlate with
BOT_LEASE_LOST/BOT_DRAIN_STEP_FAILEDto identify the upstream trigger.
BOT_STRATEGY_EVALUATION_ERROR
Severity: Error | Category: Control
Strategy::evaluate returned a non-panic StrategyError. The loop halts.
- Causes: strategy-side invariant violation (e.g. unsupported timeframe, config drift, state corruption).
- Action: check
errorstructured field for theStrategyErrorvariant. Fix config or resettrading.bot_strategy_stateif corruption is suspected. Manager respawns ondesired_state. - Escalate: repeated eval errors on the same strategy point to a code-level bug — file a story against the strategy crate.
BOT_STRATEGY_PANIC
Severity: Critical | Category: Control
Strategy::evaluate panicked inside the spawn_blocking panic guard. Strategy state is discarded; the loop halts and the process exits so manager can respawn with a clean instance.
- Causes: strategy code bug (e.g. unwrap on a
None, division by zero, array out-of-bounds). - Action: STOP pulling in new strategy-crate versions until the panic source is fixed. Capture the panic payload from the structured log (
panicfield) and file a critical issue. - Escalate: paging event. A panicking strategy can quarantine a bot and block its run.
BOT_FENCE_MISMATCH
Severity: Warning | Category: Control
Fence gate inside the emission transaction rejected the intent: bot_leases row missing, holder_bot_id mismatch, or fence advanced externally. Covers live + manual emission paths.
- Causes: lease lost mid-flight (racing with liveness detection); operator ran a manual fence bump; green/blue swap between read and commit.
- Action: self-healing via halt + respawn path. Verify
bot_leasesstate. Check for an in-flightBOT_LEASE_LOSTon the same bot. - Escalate: recurring fence mismatch without a paired
BOT_LEASE_LOSTindicates an unknown mutator onbot_leases.
BOT_INTENT_EMIT_FAILED
Severity: Error | Category: Control
Atomic intent-emission transaction failed at the SQL layer (fence read + intent insert
- state upsert). Covers live and manual emission paths.
- Causes: Postgres unreachable mid-txn;
order_intentsunique-constraint violation; pool saturation. - Action: loop halts; manager respawns if
desired_state=running. Inspect the underlyingerrorfield for the SQL cause. - Escalate: repeated emit failures block signal production — check Postgres health and pool saturation.
BOT_ON_FILL_FAILED
Severity: Warning | Category: Control
Strategy::on_fill returned an error, a post-on_fill state-serialization or -persistence step failed, or a fill_events payload failed to decode. Listener logs
- continues (fill already occurred; SL is exchange-resident).
- Causes: strategy
on_fillinvariant; serialize-state failure; payload schema drift; DB transient error onupsert_strategy_state. - Action: not pageable alone — next evaluate tick re-persists state. For decode failures, confirm executor and bot ship matching
gordon-contractsversions. - Escalate: sustained
BOT_ON_FILL_FAILEDwith same intent_id means the strategy is silently ignoring a fill — investigate the strategy'son_fillcontract.
BOT_ORDER_EVENT_INVALID
Severity: Warning | Category: Control
An order_events row was malformed: payload decode failed or the event tag is not a known variant (submitted / acked / filled_partial / filled_complete / rejected / cancelled). Forward-compat surface — the row is skipped.
- Causes: executor / bot shared-schema version skew; forward-compat additive field that fails older validation.
- Action: check the
eventfield in the structured log against the supported taxonomy. Pin compatible gordon-contracts versions between executor + bot. - Escalate: if the unknown event is a new executor state transition not yet wired into the bot pending set, schedule a bot upgrade.
DATA codes
DATA_INGEST_GAP_DETECTED
Severity: Warning | Category: Data
A gap in inbound market data exceeds the configured tolerance (e.g. missing 1m candles).
- Causes: Binance WebSocket disconnected and reconnect took > tolerance; Binance API outage; gordon-data container restarted mid-stream.
- Action: check
docker compose logs gordon-data --tail=100for reconnect events. Runmake fill-gapsto backfill synthetic candles for the gap window. Verifymarket_data.spot_klineshas no missing 1m bars before resuming live bots. - Escalate: gaps > 30 min indicate a sustained Binance outage. Monitor status.binance.com and wait for normalisation before running strategies.
DATA_UNKNOWN_SYMBOL
Severity: Warning | Category: Data
symbol query parameter is not in the configured allowlist (GORDON_DATA_SYMBOL_ALLOWLIST).
- Causes: caller requests a symbol not in the default 10-pair allowlist; allowlist not extended after adding a new pair.
- Action: if the symbol is intentionally new, add it to
GORDON_DATA_SYMBOL_ALLOWLISTin compose env and re-seed historical data. If the symbol is a typo, fix the caller.
DATA_LIMIT_EXCEEDED
Severity: Warning | Category: Data
limit query parameter exceeds the maximum allowed value (5000 for klines endpoints).
- Causes: client requesting too many candles in one call.
- Action: use pagination (
from/towindow) or reduce thelimitparameter.
DATA_INVALID_TIMEFRAME
Severity: Warning | Category: Data
tf query parameter is not a recognised timeframe string.
- Causes: caller using uppercase (
1H), a non-standard string, or a timeframe not in:1m 5m 15m 30m 1h 2h 4h 6h 8h 12h 1d 1w. - Action: fix the caller to use lowercase timeframe strings from the above list.
DATA_INVALID_TIMERANGE
Severity: Warning | Category: Data
from/to window is invalid: from must be strictly less than to (epoch-milliseconds).
- Causes: caller reversed
from/to;fromequalsto; milliseconds vs seconds confusion (off by 1000×). - Action: verify both values are epoch-milliseconds and
from < to. Check for units confusion — Binance timestamps are milliseconds.
DATA_INVALID_KIND
Severity: Warning | Category: Data
An enum-valued query parameter (kind on /long_short_ratio, side on /liquidations) did not match any accepted variant.
- Causes: typo in
kindorsidevalue; client code not updated after API change. - Action: fix the caller. Valid
kindvalues:global | top_account | top_position. Validsidevalues:BUY | SELL.
DATA_INVALID_REQUEST
Severity: Error | Category: Data
POST /warmup request body failed structural or semantic validation before any repository call. Covers: empty dataset list, dataset-count cap (>12) exceeded, unknown kind tag, required field missing or blank, numeric bounds violated (lookback_bars / lookback_count must be 1–5000; lookback_minutes must be 1–10080).
- Causes: client bug; stale generated client not matching current API shape.
- Action: inspect the
messagefield in the HTTP 400 body for the exact constraint. Regenerate the typed client from the OpenAPI spec if the shape has changed.
DATA_QUERY_FAILED
Severity: Error | Category: Infra
A repository query failed at runtime — the DB returned an error that is neither a constraint violation nor a connectivity probe failure. Typical causes: query timeout, pool exhaustion, temporary connectivity blip.
- Causes: Postgres overloaded; pool max_connections exhausted by concurrent warmup requests; transient network partition between gordon-data and Postgres.
- Action: check
docker compose logs gordon-data --tail=100for the underlying sqlx error. Check Postgres CPU + connection count (SELECT count(*) FROM pg_stat_activity). The 500 response is returned to the caller — the bot/manager retries the request. - Escalate: if the rate of
DATA_QUERY_FAILEDis sustained (>1/min over 5 min), Postgres may be saturated. Checkmarket_data.*index health and query plans.
DATA_DB_PROBE_FAILED
Severity: Error | Category: Infra
The DB connectivity probe failed. Fired on /healthz (returns 503) and on the /warmup 503 path. Indicates gordon-data cannot reach Postgres at all.
- Causes: Postgres container not running;
GORDON_DATABASE_URLmisconfigured; network partition between gordon-data and Postgres containers. - Action:
docker compose ps postgres— verify Postgres is running. CheckGORDON_DATABASE_URLin the service environment. Check container network: both containers must be on the same compose network. - Escalate: if Postgres is running and reachable from the host but not from gordon-data, inspect the compose network config. If the disk is full, Postgres may have shut itself down — check
df -hon srv-apps.
DATA_SOURCE_NOT_REGISTERED
Severity: Warning | Category: Data
SourceHealthRegistry::record_success was called with a source ID that was never registered at startup. Indicates a code-level bug — a source emits health ticks without having registered itself in the registry.
- Causes: new ingest source added without a
register()call in the startup path; source ID mismatch between registration and tick emission. - Action: this is a programming error in gordon-data, not an operator issue. Open a bug report and fix the registration gap. The missing registration means the source will not appear in
/sources/healthoutput.
DATA_ROLE_PROBE_ERROR
Severity: Warning | Category: Infra
The startup DB role-probe encountered an unexpected SQL error — neither 42501 (insufficient_privilege) nor 42P01 (undefined_table). The query got past the privilege check, so the service treats this as excess-privilege for fail-safe behaviour and refuses to start.
- Causes: DB schema drift; role has unexpected privileges; new Postgres error code not handled by the probe; transient connectivity during the probe query.
- Action: check
docker compose logs gordon-data --tail=50for the raw SQL error. Verify thegordonDB role has exactly the privileges in migration 0016 (no INSERT on trading schema; write on market_data). Re-runmake db-setupif role permissions have drifted.
DATA_STARTUP_FAILED
Severity: Critical | Category: Infra
Fatal startup failure: configuration load rejected the environment, the serve loop exited with an error, or the backfill-report DB pool could not be opened. Process exits non-zero; orchestrator restarts or pages.
- Causes: malformed
GORDON_DATA_*env vars; missingDATABASE_URL; Binance URL unreachable at startup; read-only key probe discoveredcan_withdraw=true. - Action:
docker compose logs gordon-data --tail=100and look for the precedingerror = ...context on thefailed to load configuration/server exited with errorline. Fix the config or credential and restart. - Escalate: if the service crash-loops for >5 min, page on-call.
DATA_SHUTDOWN_ERROR
Severity: Warning | Category: Infra
A background task (scheduler, ingest driver, subscriber, serve loop) returned an error or panicked while the shutdown coordinator was awaiting drain. Process is exiting; drain still proceeds — this is an observability signal, not a safety event.
- Causes: race at shutdown where a source handle panicked mid-tick; axum serve returned after the broadcast fired; drain budget exceeded.
- Action: informational — look at the
task = .../error = ...fields to see which task slipped. Open a follow-up if the same task fails repeatedly across restarts.
DATA_INTERNAL_ERROR
Severity: Error | Category: Infra
Internal invariant violation: OpenAPI render failed, or a file-/stdout-write of the rendered spec failed on the openapi export subcommand path. Should never fire in practice — it signals a code-level bug (utoipa rejected a derived schema, or stdout was redirected to a read-only target).
- Causes: a schema derivation broke after a rebase;
openapi export - > fileran withfileowned by another user. - Action:
cargo run --bin gordon-data -- openapi export -locally to reproduce the render error. If it fails, the OpenAPI schema closure is broken — fix theutoipa::ToSchemaderive that regressed.
DATA_BACKFILL_CLI_INVALID
Severity: Error | Category: Data
Backfill CLI rejected command-line arguments before any DB work: invalid date range (from >= to), malformed YYYY-MM-DD date, unknown period token, or no symbols supplied. Process exits with code 2 (usage error).
- Causes: operator typo; wrong flag order; missing
--symbolson a source that requires it. - Action: inspect the
raw = .../error = ...fields to see which arg was rejected, then re-run with the correction. Seegordon-data backfill --help.
DATA_BACKFILL_FAILED
Severity: Error | Category: Data
A backfill job failed at runtime: driver task panicked, cancel-awaited task panicked, finalise reported BackfillRunError (non-cancel), or the CLI subcommand exited non-zero.
- Causes: upstream API outage mid-run; DB insert failed; cursor advancement logic hit an unreachable branch.
- Action: check
GET /backfill/jobs/:idfor theerrorfield — it carries the underlyingBackfillRunErrormessage. Re-run the job after the upstream is healthy. - Escalate: sustained backfill failures (>3 consecutive) indicate an upstream contract regression or a provider rate-limit change.
DATA_BACKFILL_CONFLICT
Severity: Warning | Category: Data
POST /backfill/<source> rejected because a running job already holds the (source, symbol_key) conflict key. Response is 409 with existing_job_id so the caller can poll the in-flight job.
- Causes: operator double-clicked "Run backfill" in the console; automation retried without awaiting completion.
- Action: poll
GET /backfill/jobs/:existing_job_iduntil terminal, then re-submit if needed.
DATA_JOB_NOT_FOUND
Severity: Warning | Category: Data
DELETE /backfill/jobs/:id or GET /backfill/jobs/:id was given a job id that is not in the in-memory registry. Returns 404.
- Causes: stale UUID from a bookmark; entry aged out of the terminal-retention cap (in-memory-only; no DB persistence).
- Action: list current jobs at
GET /backfill/jobs; the caller should refresh its job id.
DATA_BACKFILL_CURSOR_STUCK
Severity: Warning | Category: Data
A backfill source (spot_klines / perp_klines / funding_rates / open_interest) detected a cursor that did not advance after a page fetch. Driver breaks out of the per-symbol loop to prevent an infinite spin.
- Causes: upstream provider returned a page whose last row shares the cursor timestamp (boundary condition); provider pagination regression.
- Action: inspect
symbol = ... cursor = ... next_cursor = ...in the log. Ifnext_cursor == cursor, the provider bug is confirmed — re-run with a narrowed window avoiding the stuck boundary. A terminalDATA_BACKFILL_FAILEDfollows if the symbol yields zero rows.
DATA_INGEST_WS_CLOSED
Severity: Warning | Category: Infra
A WebSocket or IPC subscriber stream ended unexpectedly. The combined ingest receiver was closed by the gordon-exchange reconnect loop, or a PostgresSubscriber stream yielded None (upstream dropped). The ingest / subscriber task exits cleanly.
- Causes: gordon-exchange's internal reconnect loop gave up; Postgres
LISTENconnection dropped; source commands subscriber was cancelled. - Action: informational by itself. Look for preceding
DATA_SOURCE_FETCH_FAILEDentries. If the ingest does not resume (watch/sources/health), restart the service.
DATA_INGEST_FRAME_DROPPED
Severity: Warning | Category: Data
A WebSocket frame was unusable: missing symbol, invalid symbol / timeframe format, trade_time overflowed i64, or a malformed DataEvent envelope was decoded. The frame is dropped; the driver continues.
- Causes: upstream API emitted a protocol variant new to this build; connection bit-flip on a low-quality network path.
- Action: look at the
symbol = ... error = ...context. If the same malformed shape repeats across many frames from the same symbol, the upstream contract has shifted — update the parser.
DATA_INGEST_WRITE_FAILED
Severity: Error | Category: Infra
A persist path on the WS ingest failed: klines upsert_one_into or the liquidations bulk insert returned a sqlx::Error. The frame is dropped (not retried); if upstream re-emits the row the next frame replays.
- Causes: Postgres connection pool exhausted; partition add-new failed; constraint violation (data-shape mismatch).
- Action: check Postgres health; inspect the
error = ...field. For liquidations, a dropped bulk insert loses history — alert on sustained rate. - Escalate: sustained write failure (>1/min over 5 min) → page on-call.
DATA_SOURCE_WRITE_FAILED
Severity: Error | Category: Infra
A scheduler source row-write failed at the DB layer. Covers upstream sources (binance_funding, binance_open_interest, binance_long_short_ratio, alternative_fear_greed, defillama_ssr, fred_macro) and derived sources (derived_metrics, derived_vpin, klines_common). Scheduler retry loop applies on the next tick; sustained failure trips the quarantine threshold.
- Causes: Postgres pool exhausted; constraint mismatch; partition missing for the target timestamp.
- Action: correlate with
DATA_INGEST_WRITE_FAILED. If both fire, the DB is unreachable — triage at the DB layer first.
DATA_SOURCE_FETCH_FAILED
Severity: Warning | Category: Infra
A scheduler source fetch failed at the upstream API or during parsing: network error, HTTP non-2xx, response-body parse failure (timestamp, numeric, enum), or a staleness probe hit an error. Retryable; the scheduler's retry-budget + quarantine machinery owns escalation.
- Causes: upstream rate limit; provider outage; upstream schema change.
- Action: watch for escalation to
DATA_SOURCE_QUARANTINED. If the source is Deribit GEX or FRED, a single failure is noisy but not actionable — escalate only on sustained rate.
DATA_SCHEDULER_PANIC
Severity: Error | Category: Infra
The scheduler received SourceError::Panic(msg) on a source fetch — a source invariant was violated inside the fetcher. Distinct from a tokio task panic (which would propagate out of the driver).
- Causes: a
.unwrap()on a response envelope; a derive regression; an out-of-bounds vector index after an API shape change. - Action: the panicking source is named in
source = .... Investigate the invariant. The source is NOT quarantined automatically on panic — operator must decide whether to quarantine manually while the fix ships.
DATA_SOURCE_QUARANTINED
Severity: Warning | Category: Safety
The scheduler flipped a source into the quarantined state after a non-retryable failure or streak-exhaustion. The source is taken out of rotation until an operator issues SourceCommand::Unquarantine via the manager BFF.
- Causes: preceding
DATA_SOURCE_FETCH_FAILEDorDATA_SOURCE_WRITE_FAILEDstreak. - Action: fix the root cause (upstream or DB); issue an
Unquarantinecommand to resume. UsePOST /sources/:name/unquarantine(operator-token-protected).
DATA_IPC_PUBLISH_FAILED
Severity: Warning | Category: Infra
Best-effort DataEvent publish to the trading.data_events Postgres channel failed: KlineWritten, SourceQuarantined, SourceUnquarantined, SourceFailure, MacroWritten, GexSnapshot. The authoritative state lives in market_data.* — audit-bus absence does not roll back the write. Not pageable alone.
- Causes: Postgres
NOTIFYpayload size exceeded; connection briefly lost. - Action: informational. Dashboards should alert on sustained miss rate.
DATA_SOURCE_REGISTRATION_FAILED
Severity: Warning | Category: Infra
A scheduler source builder returned an error at startup: HTTP client init failed, or a required constructor argument was rejected. The source is not registered; the rest of the scheduler still starts. The read-only Binance key-probe failure surfaces here too — gordon-data continues with public endpoints only.
- Causes: malformed env var for a specific source; missing API key for FRED / Binance; startup probe timeout.
- Action: look at the
role = ... error = ...pair. Fix the env or endpoint and restart. Until then, the affected source does not emit rows and/sources/healthreports it asstale.
DATA_REPORT_COMPUTATION_FAILED
Severity: Error | Category: Infra
A /backfill/<source>/report handler failed at the DB layer: sqlx::Error on the coverage-count query, or the underlying report-computation runner returned an error. Returns 500.
- Causes: same classes as
DATA_QUERY_FAILED— pool exhausted, timeout, connectivity blip. - Action: correlate with
DATA_QUERY_FAILED. Caller retries once DB health is restored.
DATA_SUBSCRIBER_FAILED
Severity: Warning | Category: Control
source_commands subscriber surface — one code covers start failure, stream-end, cursor commit_offset failure, unknown / malformed variant, and Unquarantine targeting an unknown source. gordon-data exposes a single SourceCommand::Unquarantine variant today so the taxonomy collapses what gordon-bot splits across three codes.
- Causes: manager or operator sent an unknown
SourceCommandvariant (forward-compat); subscriber channel dropped;LISTENcursor commit lost a race with a pool blip. - Action: informational. If a specific unknown variant shows up repeatedly, the manager is newer than this data build — roll the data service.
MANAGER codes
MANAGER_RECONCILER_DRIFT
Severity: Error | Category: Control
Desired-state reconciler found drift between bot_configs and the running container set that it could not resolve.
- Causes: docker-socket-proxy returned unexpected state; container was manually killed outside manager;
bot_configs.desired_stateand actual container state divergedmax-retry window.
- Action: check
trading.bot_configsfor the affectedid. Comparedesired_statevsstatus. Inspect reconciler logs for the specific drift. If the container is zombie, prune it manually then let the reconciler recover. - Escalate: reconciler drift that persists > 5 minutes indicates the reconciler is stuck in backoff — check
trading.reconciler_statefor the affected bot.
MANAGER_UNAUTHORIZED
Severity: Warning | Category: Control
X-Operator-Token header absent, wrong, or not configured on the manager service.
- Causes:
GORDON_MANAGER_OPERATOR_TOKENnot set (returns 503); token missing or wrong on request (returns 401 — indistinguishable by design to prevent config side-channel). - Action: verify
GORDON_MANAGER_OPERATOR_TOKENindocker-compose.yml. If the service is returning 503 on all protected routes, the token was not set at startup.
MANAGER_INVALID_IMAGE_TAG
Severity: Error | Category: Control
image_tag on POST /bots or target_image_tag on POST /bots/:id/promote does not match the Docker tag format ^[a-zA-Z0-9][a-zA-Z0-9._-]{0,127}$.
- Causes: tag contains slashes, spaces, or special characters; tag is empty; tag is > 128 characters.
- Action: fix the tag to match the format. Valid examples:
sha-a1b2c3d,v1.2.3,latest.
MANAGER_INVALID_CURSOR
Severity: Error | Category: Control
Pagination cursor failed HMAC MAC verification, base64 decoding, or structural validation.
- Causes: cursor was tampered with;
GORDON_MANAGER_OPERATOR_TOKENwas rotated between when the cursor was issued and when it was presented; cursor from a different environment (staging vs prod). - Action: discard the cursor and restart pagination from the first page. If this fires after a token rotation, all cached cursors are invalid — expected behaviour.
MANAGER_INVALID_STATE_TRANSITION
Severity: Error | Category: Control
Lifecycle action violates the bot state machine.
- Causes: attempting to start an already-running bot; pausing a stopped bot; using
PATCH /bots/:idto setdesired_statedirectly instead of the lifecycle endpoints. - Action: use the dedicated lifecycle endpoints:
/start,/pause,/resume,/stop. Valid transitions:stopped → running(start);running → paused(pause);paused → running(resume);running|paused → stopped(stop).
MANAGER_BODY_TOO_LARGE
Severity: Error | Category: Control
Request body exceeds the 1 MiB limit.
- Causes: oversized bot config payload; accidental binary data sent to a JSON endpoint.
- Action: reduce payload size. Bot configs should be < 1 KB in practice — anything approaching 1 MiB is almost certainly a client bug.
MANAGER_INVALID_REQUEST
Severity: Error | Category: Control
Catch-all 4xx on the manager BFF read surface (/runs, /runs/:id, /runs/:id/roundtrips, /runs/:id/equity, /bots/:id/equity). Shape determined by HTTP status:
400 — query-param value rejected:
kindnot inlive|paper|backtest;resolutionnot in1m|1h|1d. Body carries afieldpointer at the offending parameter.404 — targeted resource (run by
id, bot byid) does not exist. Body carries afieldpointer atid.Causes: console or API caller sent an unsupported enum value; UUID refers to a deleted / never-existed row; race with a concurrent delete.
Action: caller fixes the input. Runs are never hard-deleted in normal operation, so a 404 on a UUID the caller held references usually indicates a stale cache on the caller side (console page that predates a cleanup, or a cursor that crossed a retention cutoff). Not pageable — dashboards should track sustained rate as a UX-health signal, not an infrastructure one.
Distinct from the typed invalid-* variants (MANAGER_INVALID_IMAGE_TAG, MANAGER_INVALID_CURSOR, MANAGER_INVALID_STATE_TRANSITION) which each have dedicated shape + remediation; those fire on write-path validation. This variant is for the read-path (BFF) where the input space is a small fixed enum or a UUID lookup.
MANAGER_INVALID_STRATEGY
Severity: Error | Category: Control
POST /runs body carries a strategy_name that is not registered in the StrategyRegistry. Returned as HTTP 400.
- Causes: console sent a strategy name that does not match any entry in the server-side registry (typo, stale dropdown, or the server was redeployed with a different build that removed the strategy).
- Action: caller should fetch
GET /strategiesto get the current registered list and surface it to the user. If the strategy name is correct, the manager binary may be out of date — check the deployed image tag.
Distinct from MANAGER_INVALID_REQUEST which is the catch-all for other 4xx validation failures. This variant is dedicated so the console can surface the registered strategy list in its error UI rather than a generic "bad request" message.
MANAGER_STARTUP_FAILED
Severity: Critical | Category: Infra
Manager service failed to start; process exits non-zero.
- Causes:
Config::from_envrejected an invalidGORDON_MANAGER_*env var; Tokio multi-thread runtime build failed;openapi exportpure-render CLI errored; the HTTP serve loop exited with an error before reaching steady state. - Action: inspect the structured error attached to the event. Most often a missing or malformed env var (URL, port, duration) — fix and restart. If the runtime build failed, check container resource limits (nproc/memory).
- Escalate: if restart loops > 3 within 5 minutes, pause the deploy and file a postmortem — a Critical startup failure must not be silently tolerated.
MANAGER_SHUTDOWN_ERROR
Severity: Warning | Category: Infra
Non-fatal shutdown-path error.
- Causes: SIGTERM handler install failed (Unix only);
ctrl_clistener errored; signal task join failed; reconciler drain-await returned an error. - Action: container replacement still proceeds — this is an observability signal, not a safety event. Check for repeated occurrences across restarts; a flapping shutdown-error pattern usually indicates a stuck reconciler task.
MANAGER_BOOT_DEGRADED
Severity: Warning | Category: Control
Manager booted in a degraded mode but is serving.
- Causes:
GORDON_MANAGER_RECONCILE_INTERVAL_MSset below the safety floor and clamped up; an optional dependency was missing at startup without failing the process. - Action: audit the config vs the floor printed in the log context. If the clamp is intentional (e.g. tuning), silence the warn via config alignment; if not, fix the config and restart.
MANAGER_DB_TRANSIENT
Severity: Warning | Category: Infra
Transient DB error on a stateless HTTP handler read path.
- Causes: Postgres query timeout, pool exhaustion, connection reset; no application- logic fault on the handler side.
- Action: returns 500 to the caller; client retries. Confirm Postgres health (
pg_stat_activity,pg_stat_replication) if the rate is sustained. - Escalate: a steady rate > 1/s across handlers is a DB incident — escalate to
DATA_DB_PROBE_FAILED/ Postgres runbook.
MANAGER_RECONCILER_TICK_FAILED
Severity: Warning | Category: Control
Reconciler tick hit a self-healing error and skipped work; next tick retries.
- Causes:
load_live_configsfailed (DB blip);list_bot_containersfailed (docker-socket-proxy blip); file-SD write failed (tmp dir eviction); advisory-lock acquire failed (contention);record_success/record_failure/quarantineUPDATE failed (DB blip). - Action: single tick is self-healing — no operator action unless sustained. The noise-floor test asserts zero WARN on an idle reconciler; a flapping signal here means infra instability, not application logic.
- Escalate: sustained tick-failure rate rolls up to
BOT_QUARANTINEDviaon_reconcile_errorafter the per-bot failure threshold.
MANAGER_DEPLOY_STEP_FAILED
Severity: Warning | Category: Control
A single deploy-tick step failed; state machine retries next tick.
- Causes:
tick_onestate-machine step errored for a specific deploy; stop/remove blue oncomplete_deployfailed (manager must not block completion); stop/remove green onabort_deployfailed. - Action: individual container cleanup fallout is expected to be resolved by the reconciler on the next pass. If a blue container is stuck after a complete, operator can
docker rm -fit manually. - Escalate: repeated step failures on the same
deploy_id— inspecttrading.bot_deploysfor the row and review the green/blue state.
MANAGER_DEPLOY_INITIATION_FAILED
Severity: Error | Category: Control
Deploy initiation (kickoff) failed — green/blue flow never started.
- Causes: manager could not acquire the shadow lease;
bot_deploysinsert failed (duplicate in-flight row, FK violation); docker-socket-proxy rejected the green spawn. - Action: for reconciler-initiated kickoffs (
auto_deploy=true),on_reconcile_errorrecords + re-attempts. For operator-initiated kickoffs (POST /bots/:id/promote), the HTTP caller sees 500 and can retry after inspecting the structured error. - Escalate: if
auto_deployreconciler-triggered kickoffs fail repeatedly, the bot drifts into quarantine — clear quarantine and investigate the underlying cause.
MANAGER_IPC_PUBLISH_FAILED
Severity: Warning | Category: Infra
Fire-and-forget IPC publish failed; DB row is authoritative.
- Causes:
ipc_notify_triggerpublisher errored on a best-effort path (BotCommand, reconciler event, deploy event, quarantine-cleared, manual deploy-requested). - Action: none — the DB write already committed, and the reconciler / next listener tick converges state regardless. The loss is an audit row in
trading.bot_events. - Escalate: sustained publish failure means the notify channel or the publisher is stuck — inspect
SELECT * FROM pg_listening_channels()on the listener side.
MANAGER_INTERNAL_ERROR
Severity: Error | Category: Infra
Internal invariant violation; should never fire in practice.
- Causes: OpenAPI spec render failed (
serde_json::to_string(&*doc)errored on a malformedutoipa::openapi::OpenApi);stdout().write_allfailed on backtest summary (closed stdout); role-probe excess-privilege (the startup probe got past the privilege check — fail-safe treats this as internal error). - Action: inspect the structured error. OpenAPI render failures indicate a schema regression — regenerate + retest the spec. Role-probe excess-privilege indicates a privilege-drift on
gordon_manager— audit the role grants. - Escalate: file a postmortem. None of these should fire — when one does, it points at a latent bug or a privilege drift.
MANAGER_BACKTEST_FAILED
Severity: Error | Category: Control
Backtest subcommand failed; process exits ExitCode::FAILURE.
- Causes: DB pool refused connection (
GORDON_DATABASE_URLwrong or DB down); engine returnedBacktestError(unknown strategy, invalid params, kline read failed);runrow insert failed atsqlxlayer. - Action: inspect the structured error for the surface. DB-down → start Postgres. Strategy/param error → fix the CLI invocation. Kline read failed → run
make seedto populatemarket_data.spot_klinesfor the window.
MANAGER_BACKTEST_ABORTED
Severity: Warning | Category: Control
Backtest aborted cleanly — not a failure of the engine.
- Causes: the requested window had no klines in the configured symbol/timeframe; operator hit Ctrl+C (SIGINT) before the engine completed.
- Action: for "no klines" — verify
market_data.spot_klinescoverage with aSELECT MIN(ts), MAX(ts) FROM klines WHERE symbol=... AND timeframe=...query. For SIGINT — no action; thetrading.runsrow is left withcompleted_at IS NULLso operators can see it was aborted mid-flight.
MANAGER_UPSTREAM_UNAVAILABLE
Severity: Warning | Category: Infra
BFF pass-through to an upstream service failed.
- Causes: gordon-data unreachable (
reqwest::Error, connect refused); gordon-data returned non-2xx on/sources/health; response body failed JSON parsing. - Action: manager returns 502 (parse / non-2xx) or 503 (unreachable) to the caller; client retries. Inspect gordon-data's own logs +
/healthzto confirm the upstream state. - Escalate: sustained failures on
/data/statusindicate gordon-data is down or stuck — escalate toDATA_*runbook for that service.
MANAGER_SOURCE_HEALTH_SUBSCRIBER_START_FAILED
Severity: Warning | Category: Infra
The data_events subscriber in gordon-manager failed to start. Source-health state will not be updated until the subscriber recovers (process restart or reconnect). The GET /source-health endpoint returns stale / empty state in this degraded mode.
- Causes: Postgres connectivity blip at startup;
PgListener::connectreturned asqlx::Error; advisory-channel registration failed. - Action: confirm Postgres health via the manager
/healthzprobe; the subscriber's outer supervisor restarts the task on the next reconcile tick. No operator intervention required for transient failures. - Escalate: if
/source-healthstays empty across multiple manager restarts, thedata_eventschannel name or thetrading.data_eventstable is misconfigured — inspect the migration history + the channel-name constant in gordon-manager.
MANAGER_SOURCE_HEALTH_SUBSCRIBER_COMMIT_FAILED
Severity: Warning | Category: Infra
commit_offset failed for a data_events row. The row will replay on the next reconnect — idempotent fold is safe (state is advance-only).
- Causes: Postgres connectivity blip during commit; offset-table write hit a pool blip; transaction was aborted by a concurrent operation.
- Action: self-healing — the next NOTIFY tick replays the row and the fold re-applies idempotently. Not pageable in isolation.
- Escalate: sustained miss rate (> 1% over 10 min) indicates a persistent commit path bug — inspect the manager
data_eventsconsumer logs for the underlyingsqlx::Errorshape.
MANAGER_SOURCE_HEALTH_EVENT_INVALID
Severity: Warning | Category: Infra
A data_events envelope payload could not be decoded as DataEvent. The row is marked consumed (schema-tolerance path) and skipped.
- Causes: gordon-data emitted a
DataEventvariant unknown to gordon-manager (version skew between services); a malformed row was inserted intotrading.data_eventsby a non-canonical writer. - Action: check gordon-data's version against manager — schema-tolerance is intentional so a newer producer never breaks an older consumer. The skipped row is a missed source-health update, not a correctness issue.
- Escalate: if multiple rows are skipped in succession, gordon-data is emitting a variant manager doesn't recognise yet — coordinate the version bump.
SHARED codes
SHARED_DB_CONSTRAINT_VIOLATION
Severity: Error | Category: Infra
A database write violated a uniqueness or foreign-key constraint.
- Causes: duplicate insert on a uniqueness constraint (usually idempotency bug); foreign-key violation (referencing a deleted parent row); stale in-memory state diverged from DB.
- Action: check the structured error for the
tableandconstraintname fields. For duplicate-key violations, verify the caller is correctly checking for existing rows before insert. For FK violations, verify the parent row exists. - Escalate: if this fires at high frequency from the same service, there is a systematic idempotency gap — file a bug and review the write path.
SHARED_STRATEGY_CONFIG_PARSE_FAILED
Severity: Warning | Category: Infra
Overlay config on bot_configs.strategy_params.overlay failed to deserialize into gordon_strategy::overlays::OverlayConfig. Emitter: extract_overlay_config helper used by gordon-bot's strategy loop (r-02a.1) and gordon-manager's backtest runner (r-02a.2).
- Causes: operator-edited JSON with wrong shape (typo'd field name, bad type, unexpected nesting); migration drift if
OverlayConfiggains a required field with no#[serde(default)]; manual DB edit bypassing BFF validation. - Behavior: overlays fail open — the helper returns
OverlayConfig::default()(all overlays disabled). Bot/backtest continues without the overlay veto layer; strategy emits intents as if overlays were off. - Action: query
SELECT id, strategy_params->'overlay' FROM trading.bot_configs WHERE id = <bot_id>. Validate the JSON against theOverlayConfigstruct (seegordon-strategy/src/overlays/mod.rs). Fix via manager BFFPATCH /bots/:idwith a valid overlay config. No restart required — the next candle tick re-extracts. - Escalate: if this fires across multiple bots simultaneously after a gordon-strategy release,
OverlayConfigshape likely changed — check the release notes + add#[serde(default)]to any new field that should be backward-compatible.
BUS codes
Emitted by the leader-elected outbox drain in gordon-bus::nats::outbox_publisher. Added at DP-06 (backbone-audit 2026-05-16) so every drain-loop warn-level log line carries a stable code + clickable URL — operators chasing a 3 AM drain stall do not have to read source.
BUS_OUTBOX_ADVISORY_LOCK_RELEASE_FAILED
Severity: Warning | Category: Infra
The leader-elected outbox drain failed to explicitly release the Postgres advisory lock (OUTBOX_PUBLISHER_LOCK_ID = 0x0B05_0010_2026_0508) on its way out (cancel, error, or graceful exit).
- Causes: pg connection dropped before
pg_advisory_unlockcould run; transient pg error on the release statement; pool-side bug holding the connection beyond the function scope. - Behavior: the lock is also released automatically when the holder's pg connection closes (session-scoped semantics), so this is a degraded-cleanup warning, not a stuck-leader bug. Another instance will pick up the lock after
LOCK_RETRY_INTERVAL(30 s) at worst. - Action: monitor frequency. Single occurrences during pod cycling are expected. A sustained pattern (more than one per pod-restart) indicates a pool connection lifetime issue — review
PgPoolconfig and any code path that may be holdingPoolConnectionreferences. - Escalate: if observed during steady-state (no deploy, no cancel), file a bug — the lock-release path is supposed to be infallible on a healthy connection.
BUS_OUTBOX_DRAIN_LOOP_EXITED
Severity: Warning | Category: Infra
The outbox drain loop returned an error (sustained NATS failure beyond FAILURE_BUDGET = 5 min, pg query failure, listener fatal). Lock is released; another instance has the chance to take over after LOCK_RETRY_INTERVAL.
- Causes: NATS broker unreachable beyond the 5-minute budget; Postgres query failure (timeout, pool exhaustion); listener channel hard error.
- Behavior: messages remain in
bus.outboxwithpublished_to_nats = FALSE. Another drain instance picks them up on next leader acquisition. - Action: check
gordon-busconsumer-lag and outbox-backlog gauges. Verify broker reachability (async_nats::client.events()). Confirm another drain instance is running and has acquired the lock (Loki:code=BUS_OUTBOX_DRAIN_LOOP_EXITED+ matchingacquired advisory lockinfo on a different host). - Escalate: if no other drain instance picks up the lock within ~5 minutes, every producer's INSERT into
bus.outboxaccumulates without forward delivery — declare a partial outage of every downstream NATS consumer.
BUS_OUTBOX_LISTENER_RECV_ERROR
Severity: Warning | Category: Infra
The bus_outbox_appended LISTEN channel recv() returned an error.
- Causes: transient pg socket flap (connection blip, network partition, pg restart);
PgListenerinternal reconnect machinery surfaced an in-progress reconnect as a recv error. - Behavior:
sqlx::postgres::PgListenerauto-reconnects internally. The drain loop treats the error as a wakeup and re-polls — no message is lost (re-poll picks up any rows that arrived during the gap). - Action: monitor frequency. Single occurrences during pg cycling are expected. Persistent occurrences indicate flaky pg connectivity — check pg logs and network metrics.
- Escalate: if the rate stays above ~1/min for more than 10 minutes, treat as a pg connectivity incident — drain throughput degrades to the
IDLE_POLL_INTERVAL(1 s) fallback.
BUS_OUTBOX_NATS_PUBLISH_FAILED
Severity: Warning | Category: Infra
A single outbox row failed to publish to NATS. The drain loop applies exponential backoff (1 s → 30 s cap) and re-attempts the same row on the next pass.
- Causes: NATS broker not reachable (network blip, broker restart); JetStream stream not configured for the subject; broker-side rate limit or quota; oversized payload (caught upstream by the
bus_outbox_payload_size_capCHECK, but a misconfigured broker stream limit could also reject). - Behavior: the row stays in
bus.outboxwithpublished_to_nats = FALSE. Backoff applies until either the publish succeeds orFAILURE_BUDGET(5 min) elapses — at which point the drain exits withBUS_OUTBOX_DRAIN_LOOP_EXITED. - Action: check the error context for the underlying NATS error. Verify broker reachability and JetStream stream config for the failing subject. For rate-limited rejections, scale broker limits or shed producer load.
- Escalate: if every retry attempt fails for the same row across multiple drain instances, the row is structurally undeliverable — file a bug; the producer likely emitted an invalid subject or oversized payload that slipped past the INSERT-time CHECK.
STRATEGY codes
Library-only warnings emitted from gordon-strategy math helpers — called from both gordon-bot (live) and gordon-manager (backtest). Added at the DP-06 raw-tracing cleanup follow-up (2026-05-17). Strategy warnings surface input-shape misconfigurations rather than runtime failures; the math returns a "metric unavailable" sentinel and the caller proceeds.
STRATEGY_DEFLATED_SHARPE_NUMERICAL_ISSUE
Severity: Warning | Category: Data
The deflated-Sharpe / PSR routine refused to compute because the sample size fell below MIN_N_OBS.
- Cause: caller passed
n_obs < MIN_N_OBS. The PSR formula is numerically unstable below this bound (skew / kurtosis sample estimators are too noisy). - Behavior: function returns
None; caller treats as "metric unavailable" and either skips it in the summary or surfaces a downstream "insufficient sample" message. - Action: verify the research pipeline is feeding a window long enough to clear the minimum. Walk-forward configs should size their evaluation slices accordingly. If the bound itself is wrong for a new use case, file a story rather than silently lowering
MIN_N_OBS— the numerical-stability constraint is the reason it exists.
STRATEGY_BACKTEST_NON_FINITE_CANDLES_DROPPED
Severity: Warning | Category: Data
BacktestExecution::new dropped one or more input candles whose OHLC contained a non-finite value (NaN / ±Inf).
- Cause: upstream data hygiene miss — a malformed candle propagated through the warmup or backfill path into the backtest input set. Without the filter,
Decimal::from_f64_retain(NaN).unwrap_or_default()collapses to zero, which trivially satisfiescandle_low <= stop_priceand triggers a spurious SL fill. - Behavior: the backtest proceeds with the cleaned (finite-only) set. Total-candle count drops by the reported number; downstream fill / equity math sees a contiguous-by-time gap, not a NaN injection.
- Action: trace the symbol back to its source (gordon-data ingest) and find the upstream gap or invalid frame. The warning's structured fields (
dropped,original_len,symbol) localise the issue. Fix the ingest path, not the backtest filter.
DP-06 follow-up codes (raw-tracing cleanup)
Codes added at the DP-06 follow-up story (plan/active/workspace/raw-tracing-cleanup.md, 2026-05-17) for the 16 pre-existing tracing::warn! sites the original DP-06 story scope deferred. Each variant maps one or more raw-warn call sites to a stable code + clickable URL — operators chasing a degraded surface no longer have to grep source.
DATA_SYMBOL_SUBSCRIPTION_FALLBACK
Severity: Warning | Category: Data
gordon-data symbol-subscription loader returned an empty trading.symbol_subscriptions table at startup and fell back to env-var defaults.
- Cause: migration 0022 (which seeds
trading.symbol_subscriptions) was not applied, or the seed rows were manually deleted, or this is the no-DB-testability path running against a fresh schema. - Behavior: ingest continues against the fallback set; the persisted
enabled = TRUEsubscriptions are not honoured for this boot. - Action: verify migration 0022 was applied (
SELECT count(*) FROM trading.symbol_subscriptions WHERE enabled = TRUE). If zero, reseed via the migration orgordon-dataadmin tooling. If this fires on a healthy production stack, the row set has drifted from operator intent.
DATA_BINANCE_TAIL_FALLBACK_FAILED
Severity: Warning | Category: Data
gordon-data /klines handler's Binance tail-fill helper (tail_fill_from_binance) failed to top up an under-filled response window.
- Cause: Binance REST unavailable (network blip, upstream rate limit, transient 5xx), a per-symbol gap upstream, or the helper's window math hit an edge case.
- Behavior: handler returns whatever DB rows were available (no top-up). The strict-mode warmup gate on the calling bot may reject; lenient consumers see a smaller-than-requested data window.
- Action: correlate with Binance status and the per-symbol ingest health in
market_data.spot_klines. If sustained for a single symbol, check the upstream ingest source. If sustained across symbols, suspect Binance-side outage or rate-limit budget exhaustion.
EXECUTOR_TEST_REGRESSION_APPLIED
Severity: Warning | Category: Control
gordon-executor applied a test-only intent regression (GORDON_EXECUTOR_TEST_REGRESSION=invert_side) that mutated an inbound intent's side after structural validation.
- Cause: the env var is set AND the executor was built with
cfg(test)or thetest-regressionsfeature. Production builds do not compile the regression hook at all — this warning cannot fire in a prod container. - Behavior: the inbound intent's
sideis flipped (Buy ↔ Sell) and submission proceeds with the flipped value. Loud-on-fire so any test fixture leakage into a production-shaped log stream is immediately auditable. - Action: if observed in production logs, the build configuration is wrong — verify the executor was not shipped with the
test-regressionsfeature enabled. Otherwise, no action: this is the e2e harness exercising the regression path.
MANAGER_STACK_HEALTH_UPSERT_FAILED
Severity: Warning | Category: Infra
gordon-manager stack-health aggregator failed to upsert a peer's status row into service_peers after a successful probe tick.
- Cause: transient DB error (pool blip, lock contention, timeout). The probe itself succeeded — the warning is on the persistence side.
- Behavior: the peer's
last_seen_atlags by one tick. The next tick retries the upsert; the/healthzprojection observes the older row in the meantime. - Action: monitor sustained miss rate. Single occurrences during DB cycling are expected. Persistent occurrences indicate pool exhaustion or a role-grant drift on
gordon_manager— review the pool sizing and migration 0044.
MANAGER_SERVICE_DEPLOY_NATS_CONNECT_FAILED
Severity: Warning | Category: Infra
gordon-manager service-deploy swap-wiring boot probe could not connect to NATS.
- Cause:
GORDON_BUS_NATS_URLpoints at an unreachable broker (network partition, broker not yet up, wrong URL). Distinct from a hot-path NATS failure — this fires once during boot. - Behavior: the swap-wiring is disabled for this boot (SwapPending arms log a warn + time out). The rest of the manager continues to serve. No green/blue deploy will complete handshake until the manager is restarted with NATS reachable.
- Action: verify broker reachability and
GORDON_BUS_NATS_URLcorrectness. After NATS is up, restart gordon-manager so the swap-wiring is re-attempted at boot.
MANAGER_SERVICE_DEPLOY_SWAP_CONSUMER_SPAWN_FAILED
Severity: Warning | Category: Infra
gordon-manager service-deploy swap-event consumer failed to spawn during boot.
- Cause: JetStream consumer creation rejected (stream missing, durable name conflict, broker-side config drift), or the tokio task spawn itself failed.
- Behavior: the swap-wiring is partially disabled (publisher + router built but no inbound consumer). The rest of the manager continues to serve. Restart fixes once the underlying cause is resolved.
- Action: inspect the error context for the underlying spawn cause. Verify the JetStream stream + durable consumer config; check
homelab/Ansible playbook for any recent NATS topology change.
MANAGER_EXCHANGE_PING_FAILED
Severity: Warning | Category: Infra
gordon-manager /bff/exchange-ping Binance probe HTTP request failed.
- Cause: Binance unreachable (network blip, DNS, upstream 5xx, manager-side egress firewall).
- Behavior: the handler returns the most recent cached latency (stale) rather than failing the request — the console's status indicator stays live through transient outages. Cached entries TTL out after their configured window.
- Action: correlate with Binance status. Sustained failures (cache cold + Binance reachable from other services) indicate the manager-side egress is broken — investigate the manager's outbound HTTP path.
MANAGER_SYMBOLS_UPSTREAM_FAILED
Severity: Warning | Category: Infra
gordon-manager /bff/symbols/available upstream call to Binance /fapi/v1/exchangeInfo failed.
- Cause: HTTP error from Binance, non-2xx response, or response body failed to deserialise as
BinanceExchangeInfo. - Behavior: the handler returns the cached snapshot (stale) rather than failing the request, so the console keeps working through transient Binance outages. If no cache entry exists, a 503 propagates.
- Action: correlate with Binance status. Sustained failures indicate either Binance API schema drift (response body parse failures) or sustained Binance outage — escalate to a manual
exchangeInforefresh or reseed of the cached snapshot.
MANAGER_REPLAY_FILTER_INVALID
Severity: Warning | Category: Control
gordon-manager WS replay handler received a SubscribeFilter carrying a bot_id or run_id value that could not be parsed as UUID v7.
- Cause: client / fixture bug — a hostile string slipped through upstream input validation, or a stale console build is sending a non-UUID identifier. Fires from runs / roundtrips / equity-points / overlay-decisions replayers.
- Behavior: the filter is ignored; replay falls back to an unfiltered query (the safe degradation path). The client receives more rows than requested but no error.
- Action: trace the client value in the structured
bot_id/run_idfield. Fix the console build or test fixture emitting the bad value. Server-side, this is informational — no manager-side fix is appropriate.
BOT_LEASE_GUARD_DROPPED_WITHOUT_RELEASE
Severity: Warning | Category: Control
gordon-bot LeaseGuard was dropped without an explicit release() call.
- Cause: a code path skipped the
releasecall — typically a?-propagated error before the guard was explicitly released, or a test fixture forcing an abrupt drop. Not a runtime defect when intentional. - Behavior: the Postgres connection close releases the advisory lock server-side a millisecond later (auto-release semantics). The
bot_leasesrow's holder metadata stays stale until the next acquire overwrites it. - Action: if this fires in production, find the dropped path and add explicit
release()so server-sidebot_leasesholder metadata stays accurate. If fixture-only, ignore.