Skip to content

Troubleshooting

When to use: diagnosing runtime failures on the v7 stack.

Service won't start

Symptom: a service exits immediately or fails /healthz.

1. Check role probe

Every v7 service runs a DB role probe on startup. If the probe fails, the service exits with a structured error:

bash
docker compose logs <service> --tail 50 | grep -E "role_probe|FATAL|startup"

Common causes:

Log signatureCauseFix
role "gordon_X" does not existMigration has not run or failedRe-run gordon-migrate: docker compose run --rm gordon-migrate
permission denied for tableRole grants not appliedRe-run gordon-migrate to re-apply grants
password authentication failedEnv var has wrong passwordVerify GORDON_DATABASE_URL in docker-compose env
missing required env varConfig gapCompare service env against .env.example

2. Check healthz and readyz

bash
for port in 8081 8082 8083 8084 8085; do
  echo -n "port $port healthz: "
  curl -fsS --max-time 2 http://localhost:$port/healthz 2>&1 || echo FAIL
done

/healthz returns 503 if the service is alive but in a fatal state. /readyz returns 503 if warmup is incomplete or a role probe failed.

3. Migration failed

If gordon-migrate exited non-zero, all services will fail to connect:

bash
docker compose logs gordon-migrate --tail 50

If a migration failed mid-run, the DB may be in a partial state. Verify:

bash
docker compose exec postgres psql -U gordon -d gordon \
  -c "SELECT version FROM public._sqlx_migrations ORDER BY version DESC LIMIT 5;"

Compare against the expected latest migration in gordon-migrate/migrations/.

NATS consumer fell behind

Symptom: services process events but with increasing lag; intents queue up without being consumed.

bash
# Check JetStream consumer lag via NATS CLI (if installed)
nats consumer info gordon-bus executor-default

Look for Num Pending — if growing steadily, the consumer is not keeping up.

Common causes:

  • gordon-executor is down or crash-looping (check /healthz).
  • The consumer's deliver subject is misconfigured — check service startup logs for consumer_create.
  • Database write bottleneck — check Postgres CPU and connection count.

Recovery:

bash
# Restart the lagging consumer service
docker compose restart gordon-executor

# Verify consumer lag is draining
nats consumer info gordon-bus executor-default

Halt latch is on

Symptom: gordon-risk reports halted=true; no new orders are being submitted.

The halt latch lives in trading.risk_state.halted. It is set by a circuit breaker trip and requires explicit operator action to clear.

1. Identify which breaker tripped

bash
docker compose exec postgres psql -U gordon -d gordon \
  -c "SET search_path=trading; SELECT * FROM risk_state ORDER BY updated_at DESC LIMIT 1;"

# Check recent risk events
docker compose exec postgres psql -U gordon -d gordon \
  -c "SET search_path=trading; SELECT breaker, event, created_at FROM risk_events ORDER BY created_at DESC LIMIT 10;"

Breakers: drawdown, connectivity, vpin, macro, correlation.

2. Investigate the root cause

BreakerCommon causeInvestigation
drawdownPeak-to-trough equity exceeded thresholdReview trading.equity_points for the drawdown curve
connectivitygordon-data source freshness exceeded budgetCheck gordon_data_source_freshness_seconds in Grafana; verify gordon-data is ingesting
vpinVPIN score above toxic-flow thresholdCheck gordon_risk_vpin_value in Grafana
macroFRED macro regime triggeredCheck recent macro data in market_data.metrics
correlationCross-asset correlation density spikedCheck recent correlation data

3. Clear the halt after fixing the root cause

Current routing (DP-12 in-flight): POST /risk/resume currently goes directly to gordon-risk. When DP-12 ships, it will route through the manager BFF at POST /bff/risk/resume.

bash
# Current (DP-12 not yet shipped)
curl -X POST http://localhost:8082/risk/resume \
  -H "x-operator-token: <operator-token>" \
  -H "Content-Type: application/json" \
  -d '{"reason": "root cause investigated and resolved"}'

Expected response: {"status":"resumed","resumed_bots":[...]}.

Postgres LISTEN/NOTIFY missed

Symptom: gordon-manager's WS fanout stops updating the console; channels like risk_halt_changed go stale.

The gordon-bus pg-NOTIFY outbox has a debounce window. On Postgres reconnect, the subscriber catches up from the last committed offset.

Diagnosis:

bash
docker compose logs gordon-manager --tail 50 | grep -E "notify|listen|reconnect|ipc"

If the manager lost the LISTEN connection:

bash
docker compose restart gordon-manager

The subscriber re-subscribes on startup and replays any events it missed since the last committed offset. No data loss — the outbox pattern guarantees at-least-once delivery.

BuildKit OOM during e2e

Symptom: make e2e fails with rpc error: code = Unavailable desc = error reading from server: EOF or io: read/write on closed pipe.

Sequential builds (story 04, 2026-04-20) mean VM OOM is not expected on a default 8 GB Docker Desktop VM. If this still happens:

Check for re-introduced parallelism

bash
grep -n "compose.*--parallel" /path/to/gordon-workspace/Makefile

Expected: no matches. If --parallel N appears, revert it.

Disk low

bash
make e2e-preflight

If [DISK_LOW] appears:

bash
make clean-workspace
make e2e-preflight
make e2e

For severe disk pressure:

bash
AGGRESSIVE=1 make clean-workspace

Crash recovery

After a daemon crash that left orphaned containers:

bash
make e2e-recover
make e2e-preflight && make e2e

See e2e-testing for the full decision tree.

Service reports degraded boot

Symptom: service is alive (/healthz 200) but /readyz returns 503 with BOOT_DEGRADED code.

This means the service started but one or more non-fatal initialisation steps failed. The service continues operating at reduced capability.

bash
curl -fsS http://localhost:<port>/readyz | jq .
docker compose logs <service> --tail 100 | grep -E "DEGRADED|degraded|WARN"

Common non-fatal degraded causes:

  • Warmup data not fully available (bot) — see bot-warmup-fallback.
  • Optional upstream unreachable at boot (FRED macro data).
  • IPC subscriber start failed — manager loses freshness fanout but continues operating.

For each degraded cause, fix the upstream issue and restart the service. Degraded services should not run production traffic; investigate and recover promptly.

Gordon — keep compounding without blowing up