Skip to content

Incident response

When to use: a circuit breaker has tripped, positions need emergency liquidation, a bot is quarantined, or operator tokens need rotation.

Halt-latch triggered

The halt latch lives in trading.risk_state.halted. It is set by any of the five circuit breakers and requires explicit operator action to clear. No new orders are submitted while the latch is active.

1. Confirm the halt

bash
curl -fsS http://srv-apps:8082/healthz | jq .
# Look for: "halted": true

Or in Grafana: gordon_risk_halt_latch == 1 in the Gordon Risk dashboard.

2. Identify which breaker tripped

bash
docker compose exec postgres psql -U gordon -d gordon -c "
  SET search_path = trading;
  SELECT breaker, event, created_at, metadata
  FROM risk_events
  ORDER BY created_at DESC LIMIT 10;
"
BreakerWhat tripped itInvestigation
drawdownPeak-to-trough equity exceeded thresholdReview trading.equity_points for the drawdown curve
connectivitygordon-data source freshness exceeded budgetCheck GET /sources/health on gordon-data:8081
vpinVPIN score above toxic-flow thresholdCheck gordon_risk_vpin_value in Grafana
macroFRED macro regime triggeredCheck market_data.metrics for recent FRED values
correlationCross-asset correlation density spikedCheck recent correlation data in metrics

3. Fix the root cause

Do not resume until the root cause is resolved. Resuming a halted system into the same condition that tripped the breaker will trigger another halt immediately.

  • For drawdown: review position sizing; verify SL distances are correct.
  • For connectivity: restore gordon-data ingest; verify source freshness is within budget.
  • For vpin: wait for market conditions to normalise; review the VPIN threshold.
  • For macro: review FRED data; adjust threshold if market regime change is not a danger signal.
  • For correlation: review correlation density; adjust threshold if appropriate.

4. Clear the halt

Current routing (DP-12 in-flight): POST /risk/resume currently calls gordon-risk directly. When DP-12 ships, this will route through the manager BFF at POST /bff/risk/resume.

bash
# Current (DP-12 not yet shipped)
curl -X POST http://srv-apps:8082/risk/resume \
  -H "x-operator-token: <operator-token>" \
  -H "Content-Type: application/json" \
  -d '{"reason": "<description of root cause and resolution>"}'

Expected response:

json
{ "status": "resumed", "resumed_bots": ["<bot_id_1>", "<bot_id_2>"] }

A resumed_bots: [] result means no bots were paused by the risk engine (the halt was set by a breaker trip, not by a per-bot pause command). The system is still resumed — the halt latch is cleared.

5. Verify

bash
curl -fsS http://srv-apps:8082/healthz | jq .halted
# Expected: false

Monitor for re-trip over the next 10–15 minutes. If the same breaker trips again, the root cause is not resolved.

Emergency flatten

Use when you need to close all open positions immediately, regardless of breaker state.

POST /risk/emergency-flatten closes all positions and halts the system. It requires an x-operator-token header. This is an operator-level destructive action.

bash
# Via manager BFF (when DP-12 ships — preferred path):
curl -X POST http://srv-apps:8083/bff/risk/emergency-flatten \
  -H "x-operator-token: <operator-token>" \
  -H "Content-Type: application/json" \
  -d '{"reason": "<why you are flattening>"}'

# Direct to gordon-risk (current — DP-12 drift):
curl -X POST http://srv-apps:8082/risk/emergency-flatten \
  -H "x-operator-token: <operator-token>" \
  -H "Content-Type: application/json" \
  -d '{"reason": "<why you are flattening>"}'

Expected response: {"status": "flatten_initiated", "positions_targeted": N}.

Monitor trading.orders for flatten orders appearing with order_type='flatten'. Verify positions close in trading.positions.

After positions close, the halt latch is set. You must investigate and resolve before resuming.

Bot quarantine

A bot enters quarantine when it fails to start max_reconcile_failures times within reconcile_failure_window_secs. Quarantined bots do not receive new bot commands; the reconciler skips them until an operator clears the quarantine.

Diagnose

bash
docker compose exec postgres psql -U gordon -d gordon -c "
  SET search_path = trading;
  SELECT id, bot_name, quarantined_at, updated_at
  FROM bot_configs
  WHERE quarantined_at IS NOT NULL;
"

# Check recent bot events for the quarantined bot
docker compose exec postgres psql -U gordon -d gordon -c "
  SET search_path = trading;
  SELECT event_type, created_at, metadata
  FROM bot_events
  WHERE bot_id = '<bot_id>'
  ORDER BY created_at DESC LIMIT 20;
"

Common quarantine causes:

Event sequenceCause
Repeated container_started + reconcile_errorBot exits non-zero on startup — check bot logs
BOT_LEASE_ACQUIRE_TIMEOUTAnother bot holds the same strategy+symbol lease — check for zombie container
BOT_ROLE_PROBE_BYPASS_DETECTEDDB role probe failed — check gordon_bot password rotation
BOT_WARMUP_INCOMPLETEgordon-data unreachable or stale — see bot-warmup-fallback

Clear quarantine

After investigating and resolving the root cause:

bash
curl -X POST "http://srv-apps:8083/bff/bots/<bot_id>/clear-quarantine?confirm=YES-$(date +%s)" \
  -H "x-operator-token: <operator-token>"

Expected response: {"status": "quarantine_cleared"}.

The manager reconciler resumes watching the bot and will attempt to restart it.

Operator tokens

Operator tokens provide constant-time authentication for destructive endpoints (/risk/emergency-flatten, /risk/resume, /bff/risk/*). Tokens are injected via Ansible vault and are never committed to any repo.

Production tokens live in homelab/group_vars/srv-apps/vault.yml under gordon_operator_token.

To rotate operator tokens:

  1. Generate a new token: openssl rand -hex 32.
  2. Update the vault: ansible-vault edit homelab/group_vars/srv-apps/vault.yml.
  3. Re-deploy via Ansible to inject the new token into gordon-manager and gordon-risk.
  4. Verify the new token works: curl -X POST http://srv-apps:8083/bff/risk/resume -H "x-operator-token: <new-token>" — should return a structured response, not 401.

Old tokens are invalid immediately after the service restarts with the new value.

Gordon — keep compounding without blowing up