Troubleshooting
When to use: diagnosing runtime failures on the v7 stack.
Service won't start
Symptom: a service exits immediately or fails /healthz.
1. Check role probe
Every v7 service runs a DB role probe on startup. If the probe fails, the service exits with a structured error:
docker compose logs <service> --tail 50 | grep -E "role_probe|FATAL|startup"Common causes:
| Log signature | Cause | Fix |
|---|---|---|
role "gordon_X" does not exist | Migration has not run or failed | Re-run gordon-migrate: docker compose run --rm gordon-migrate |
permission denied for table | Role grants not applied | Re-run gordon-migrate to re-apply grants |
password authentication failed | Env var has wrong password | Verify GORDON_DATABASE_URL in docker-compose env |
missing required env var | Config gap | Compare service env against .env.example |
2. Check healthz and readyz
for port in 8081 8082 8083 8084 8085; do
echo -n "port $port healthz: "
curl -fsS --max-time 2 http://localhost:$port/healthz 2>&1 || echo FAIL
done/healthz returns 503 if the service is alive but in a fatal state. /readyz returns 503 if warmup is incomplete or a role probe failed.
3. Migration failed
If gordon-migrate exited non-zero, all services will fail to connect:
docker compose logs gordon-migrate --tail 50If a migration failed mid-run, the DB may be in a partial state. Verify:
docker compose exec postgres psql -U gordon -d gordon \
-c "SELECT version FROM public._sqlx_migrations ORDER BY version DESC LIMIT 5;"Compare against the expected latest migration in gordon-migrate/migrations/.
NATS consumer fell behind
Symptom: services process events but with increasing lag; intents queue up without being consumed.
# Check JetStream consumer lag via NATS CLI (if installed)
nats consumer info gordon-bus executor-defaultLook for Num Pending — if growing steadily, the consumer is not keeping up.
Common causes:
- gordon-executor is down or crash-looping (check
/healthz). - The consumer's deliver subject is misconfigured — check service startup logs for
consumer_create. - Database write bottleneck — check Postgres CPU and connection count.
Recovery:
# Restart the lagging consumer service
docker compose restart gordon-executor
# Verify consumer lag is draining
nats consumer info gordon-bus executor-defaultHalt latch is on
Symptom: gordon-risk reports halted=true; no new orders are being submitted.
The halt latch lives in trading.risk_state.halted. It is set by a circuit breaker trip and requires explicit operator action to clear.
1. Identify which breaker tripped
docker compose exec postgres psql -U gordon -d gordon \
-c "SET search_path=trading; SELECT * FROM risk_state ORDER BY updated_at DESC LIMIT 1;"
# Check recent risk events
docker compose exec postgres psql -U gordon -d gordon \
-c "SET search_path=trading; SELECT breaker, event, created_at FROM risk_events ORDER BY created_at DESC LIMIT 10;"Breakers: drawdown, connectivity, vpin, macro, correlation.
2. Investigate the root cause
| Breaker | Common cause | Investigation |
|---|---|---|
drawdown | Peak-to-trough equity exceeded threshold | Review trading.equity_points for the drawdown curve |
connectivity | gordon-data source freshness exceeded budget | Check gordon_data_source_freshness_seconds in Grafana; verify gordon-data is ingesting |
vpin | VPIN score above toxic-flow threshold | Check gordon_risk_vpin_value in Grafana |
macro | FRED macro regime triggered | Check recent macro data in market_data.metrics |
correlation | Cross-asset correlation density spiked | Check recent correlation data |
3. Clear the halt after fixing the root cause
Current routing (DP-12 in-flight): POST /risk/resume currently goes directly to gordon-risk. When DP-12 ships, it will route through the manager BFF at POST /bff/risk/resume.
# Current (DP-12 not yet shipped)
curl -X POST http://localhost:8082/risk/resume \
-H "x-operator-token: <operator-token>" \
-H "Content-Type: application/json" \
-d '{"reason": "root cause investigated and resolved"}'Expected response: {"status":"resumed","resumed_bots":[...]}.
Postgres LISTEN/NOTIFY missed
Symptom: gordon-manager's WS fanout stops updating the console; channels like risk_halt_changed go stale.
The gordon-bus pg-NOTIFY outbox has a debounce window. On Postgres reconnect, the subscriber catches up from the last committed offset.
Diagnosis:
docker compose logs gordon-manager --tail 50 | grep -E "notify|listen|reconnect|ipc"If the manager lost the LISTEN connection:
docker compose restart gordon-managerThe subscriber re-subscribes on startup and replays any events it missed since the last committed offset. No data loss — the outbox pattern guarantees at-least-once delivery.
BuildKit OOM during e2e
Symptom: make e2e fails with rpc error: code = Unavailable desc = error reading from server: EOF or io: read/write on closed pipe.
Sequential builds (story 04, 2026-04-20) mean VM OOM is not expected on a default 8 GB Docker Desktop VM. If this still happens:
Check for re-introduced parallelism
grep -n "compose.*--parallel" /path/to/gordon-workspace/MakefileExpected: no matches. If --parallel N appears, revert it.
Disk low
make e2e-preflightIf [DISK_LOW] appears:
make clean-workspace
make e2e-preflight
make e2eFor severe disk pressure:
AGGRESSIVE=1 make clean-workspaceCrash recovery
After a daemon crash that left orphaned containers:
make e2e-recover
make e2e-preflight && make e2eSee e2e-testing for the full decision tree.
Service reports degraded boot
Symptom: service is alive (/healthz 200) but /readyz returns 503 with BOOT_DEGRADED code.
This means the service started but one or more non-fatal initialisation steps failed. The service continues operating at reduced capability.
curl -fsS http://localhost:<port>/readyz | jq .
docker compose logs <service> --tail 100 | grep -E "DEGRADED|degraded|WARN"Common non-fatal degraded causes:
- Warmup data not fully available (bot) — see bot-warmup-fallback.
- Optional upstream unreachable at boot (FRED macro data).
- IPC subscriber start failed — manager loses freshness fanout but continues operating.
For each degraded cause, fix the upstream issue and restart the service. Degraded services should not run production traffic; investigate and recover promptly.
Related
- Incident response — escalated actions: emergency flatten, quarantine clear
- Monitoring — Prometheus, Grafana, Loki queries
- E2E testing — full e2e failure decision tree
- Bot warmup fallback — strict/degraded boot semantics