Incident response
When to use: a circuit breaker has tripped, positions need emergency liquidation, a bot is quarantined, or operator tokens need rotation.
Halt-latch triggered
The halt latch lives in trading.risk_state.halted. It is set by any of the five circuit breakers and requires explicit operator action to clear. No new orders are submitted while the latch is active.
1. Confirm the halt
curl -fsS http://srv-apps:8082/healthz | jq .
# Look for: "halted": trueOr in Grafana: gordon_risk_halt_latch == 1 in the Gordon Risk dashboard.
2. Identify which breaker tripped
docker compose exec postgres psql -U gordon -d gordon -c "
SET search_path = trading;
SELECT breaker, event, created_at, metadata
FROM risk_events
ORDER BY created_at DESC LIMIT 10;
"| Breaker | What tripped it | Investigation |
|---|---|---|
drawdown | Peak-to-trough equity exceeded threshold | Review trading.equity_points for the drawdown curve |
connectivity | gordon-data source freshness exceeded budget | Check GET /sources/health on gordon-data:8081 |
vpin | VPIN score above toxic-flow threshold | Check gordon_risk_vpin_value in Grafana |
macro | FRED macro regime triggered | Check market_data.metrics for recent FRED values |
correlation | Cross-asset correlation density spiked | Check recent correlation data in metrics |
3. Fix the root cause
Do not resume until the root cause is resolved. Resuming a halted system into the same condition that tripped the breaker will trigger another halt immediately.
- For
drawdown: review position sizing; verify SL distances are correct. - For
connectivity: restore gordon-data ingest; verify source freshness is within budget. - For
vpin: wait for market conditions to normalise; review the VPIN threshold. - For
macro: review FRED data; adjust threshold if market regime change is not a danger signal. - For
correlation: review correlation density; adjust threshold if appropriate.
4. Clear the halt
Current routing (DP-12 in-flight): POST /risk/resume currently calls gordon-risk directly. When DP-12 ships, this will route through the manager BFF at POST /bff/risk/resume.
# Current (DP-12 not yet shipped)
curl -X POST http://srv-apps:8082/risk/resume \
-H "x-operator-token: <operator-token>" \
-H "Content-Type: application/json" \
-d '{"reason": "<description of root cause and resolution>"}'Expected response:
{ "status": "resumed", "resumed_bots": ["<bot_id_1>", "<bot_id_2>"] }A resumed_bots: [] result means no bots were paused by the risk engine (the halt was set by a breaker trip, not by a per-bot pause command). The system is still resumed — the halt latch is cleared.
5. Verify
curl -fsS http://srv-apps:8082/healthz | jq .halted
# Expected: falseMonitor for re-trip over the next 10–15 minutes. If the same breaker trips again, the root cause is not resolved.
Emergency flatten
Use when you need to close all open positions immediately, regardless of breaker state.
POST /risk/emergency-flatten closes all positions and halts the system. It requires an x-operator-token header. This is an operator-level destructive action.
# Via manager BFF (when DP-12 ships — preferred path):
curl -X POST http://srv-apps:8083/bff/risk/emergency-flatten \
-H "x-operator-token: <operator-token>" \
-H "Content-Type: application/json" \
-d '{"reason": "<why you are flattening>"}'
# Direct to gordon-risk (current — DP-12 drift):
curl -X POST http://srv-apps:8082/risk/emergency-flatten \
-H "x-operator-token: <operator-token>" \
-H "Content-Type: application/json" \
-d '{"reason": "<why you are flattening>"}'Expected response: {"status": "flatten_initiated", "positions_targeted": N}.
Monitor trading.orders for flatten orders appearing with order_type='flatten'. Verify positions close in trading.positions.
After positions close, the halt latch is set. You must investigate and resolve before resuming.
Bot quarantine
A bot enters quarantine when it fails to start max_reconcile_failures times within reconcile_failure_window_secs. Quarantined bots do not receive new bot commands; the reconciler skips them until an operator clears the quarantine.
Diagnose
docker compose exec postgres psql -U gordon -d gordon -c "
SET search_path = trading;
SELECT id, bot_name, quarantined_at, updated_at
FROM bot_configs
WHERE quarantined_at IS NOT NULL;
"
# Check recent bot events for the quarantined bot
docker compose exec postgres psql -U gordon -d gordon -c "
SET search_path = trading;
SELECT event_type, created_at, metadata
FROM bot_events
WHERE bot_id = '<bot_id>'
ORDER BY created_at DESC LIMIT 20;
"Common quarantine causes:
| Event sequence | Cause |
|---|---|
Repeated container_started + reconcile_error | Bot exits non-zero on startup — check bot logs |
BOT_LEASE_ACQUIRE_TIMEOUT | Another bot holds the same strategy+symbol lease — check for zombie container |
BOT_ROLE_PROBE_BYPASS_DETECTED | DB role probe failed — check gordon_bot password rotation |
BOT_WARMUP_INCOMPLETE | gordon-data unreachable or stale — see bot-warmup-fallback |
Clear quarantine
After investigating and resolving the root cause:
curl -X POST "http://srv-apps:8083/bff/bots/<bot_id>/clear-quarantine?confirm=YES-$(date +%s)" \
-H "x-operator-token: <operator-token>"Expected response: {"status": "quarantine_cleared"}.
The manager reconciler resumes watching the bot and will attempt to restart it.
Operator tokens
Operator tokens provide constant-time authentication for destructive endpoints (/risk/emergency-flatten, /risk/resume, /bff/risk/*). Tokens are injected via Ansible vault and are never committed to any repo.
Production tokens live in homelab/group_vars/srv-apps/vault.yml under gordon_operator_token.
To rotate operator tokens:
- Generate a new token:
openssl rand -hex 32. - Update the vault:
ansible-vault edit homelab/group_vars/srv-apps/vault.yml. - Re-deploy via Ansible to inject the new token into gordon-manager and gordon-risk.
- Verify the new token works:
curl -X POST http://srv-apps:8083/bff/risk/resume -H "x-operator-token: <new-token>"— should return a structured response, not 401.
Old tokens are invalid immediately after the service restarts with the new value.
Related
- Troubleshooting — halt-latch diagnosis and NATS lag
- Monitoring — alert rules for breaker trips and halt state
- Live trading — safety rules and pre-live checklist