Troubleshooting

When to use: diagnosing runtime failures on the v7 stack.

Service won't start

Symptom: a service exits immediately or fails /healthz.

1. Check role probe

Every v7 service runs a DB role probe on startup. If the probe fails, the service exits with a structured error:

bash

docker compose logs <service> --tail 50 | grep -E "role_probe|FATAL|startup"

Common causes:

Log signature	Cause	Fix
`role "gordon_X" does not exist`	Migration has not run or failed	Re-run `gordon-migrate`: `docker compose run --rm gordon-migrate`
`permission denied for table`	Role grants not applied	Re-run `gordon-migrate` to re-apply grants
`password authentication failed`	Env var has wrong password	Verify `GORDON_DATABASE_URL` in docker-compose env
`missing required env var`	Config gap	Compare service env against `.env.example`

2. Check healthz and readyz

bash

for port in 8081 8082 8083 8084 8085; do
  echo -n "port $port healthz: "
  curl -fsS --max-time 2 http://localhost:$port/healthz 2>&1 || echo FAIL
done

/healthz returns 503 if the service is alive but in a fatal state. /readyz returns 503 if warmup is incomplete or a role probe failed.

3. Migration failed

If gordon-migrate exited non-zero, all services will fail to connect:

bash

docker compose logs gordon-migrate --tail 50

If a migration failed mid-run, the DB may be in a partial state. Verify:

bash

docker compose exec postgres psql -U gordon -d gordon \
  -c "SELECT version FROM public._sqlx_migrations ORDER BY version DESC LIMIT 5;"

Compare against the expected latest migration in gordon-migrate/migrations/.

NATS consumer fell behind

Symptom: services process events but with increasing lag; intents queue up without being consumed.

bash

# Check JetStream consumer lag via NATS CLI (if installed)
nats consumer info gordon-bus executor-default

Look for Num Pending — if growing steadily, the consumer is not keeping up.

Common causes:

gordon-executor is down or crash-looping (check /healthz).
The consumer's deliver subject is misconfigured — check service startup logs for consumer_create.
Database write bottleneck — check Postgres CPU and connection count.

Recovery:

bash

# Restart the lagging consumer service
docker compose restart gordon-executor

# Verify consumer lag is draining
nats consumer info gordon-bus executor-default

Halt latch is on

Symptom: gordon-risk reports halted=true; no new orders are being submitted.

The halt latch lives in trading.risk_state.halted. It is set by a circuit breaker trip and requires explicit operator action to clear.

1. Identify which breaker tripped

bash

docker compose exec postgres psql -U gordon -d gordon \
  -c "SET search_path=trading; SELECT * FROM risk_state ORDER BY updated_at DESC LIMIT 1;"

# Check recent risk events
docker compose exec postgres psql -U gordon -d gordon \
  -c "SET search_path=trading; SELECT breaker, event, created_at FROM risk_events ORDER BY created_at DESC LIMIT 10;"

Breakers: drawdown, connectivity, vpin, macro, correlation.

2. Investigate the root cause

Breaker	Common cause	Investigation
`drawdown`	Peak-to-trough equity exceeded threshold	Review `trading.equity_points` for the drawdown curve
`connectivity`	gordon-data source freshness exceeded budget	Check `gordon_data_source_freshness_seconds` in Grafana; verify gordon-data is ingesting
`vpin`	VPIN score above toxic-flow threshold	Check `gordon_risk_vpin_value` in Grafana
`macro`	FRED macro regime triggered	Check recent macro data in `market_data.metrics`
`correlation`	Cross-asset correlation density spiked	Check recent correlation data

3. Clear the halt after fixing the root cause

Current routing (DP-12 in-flight): POST /risk/resume currently goes directly to gordon-risk. When DP-12 ships, it will route through the manager BFF at POST /bff/risk/resume.

bash

# Current (DP-12 not yet shipped)
curl -X POST http://localhost:8082/risk/resume \
  -H "x-operator-token: <operator-token>" \
  -H "Content-Type: application/json" \
  -d '{"reason": "root cause investigated and resolved"}'

Expected response: {"status":"resumed","resumed_bots":[...]}.

Postgres LISTEN/NOTIFY missed

Symptom: gordon-manager's WS fanout stops updating the console; channels like risk_halt_changed go stale.

The gordon-bus pg-NOTIFY outbox has a debounce window. On Postgres reconnect, the subscriber catches up from the last committed offset.

Diagnosis:

bash

docker compose logs gordon-manager --tail 50 | grep -E "notify|listen|reconnect|ipc"

If the manager lost the LISTEN connection:

bash

docker compose restart gordon-manager

The subscriber re-subscribes on startup and replays any events it missed since the last committed offset. No data loss — the outbox pattern guarantees at-least-once delivery.

BuildKit OOM during e2e

Symptom: make e2e fails with rpc error: code = Unavailable desc = error reading from server: EOF or io: read/write on closed pipe.

Sequential builds (story 04, 2026-04-20) mean VM OOM is not expected on a default 8 GB Docker Desktop VM. If this still happens:

Check for re-introduced parallelism

bash

grep -n "compose.*--parallel" /path/to/gordon-workspace/Makefile

Expected: no matches. If --parallel N appears, revert it.

Disk low

bash

make e2e-preflight

If [DISK_LOW] appears:

bash

make clean-workspace
make e2e-preflight
make e2e

For severe disk pressure:

bash

AGGRESSIVE=1 make clean-workspace

Crash recovery

After a daemon crash that left orphaned containers:

bash

make e2e-recover
make e2e-preflight && make e2e

See e2e-testing for the full decision tree.

Service reports degraded boot

Symptom: service is alive (/healthz 200) but /readyz returns 503 with BOOT_DEGRADED code.

This means the service started but one or more non-fatal initialisation steps failed. The service continues operating at reduced capability.

bash

curl -fsS http://localhost:<port>/readyz | jq .
docker compose logs <service> --tail 100 | grep -E "DEGRADED|degraded|WARN"

Common non-fatal degraded causes:

Warmup data not fully available (bot) — see bot-warmup-fallback.
Optional upstream unreachable at boot (FRED macro data).
IPC subscriber start failed — manager loses freshness fanout but continues operating.

For each degraded cause, fix the upstream issue and restart the service. Degraded services should not run production traffic; investigate and recover promptly.

Incident response — escalated actions: emergency flatten, quarantine clear
Monitoring — Prometheus, Grafana, Loki queries
E2E testing — full e2e failure decision tree
Bot warmup fallback — strict/degraded boot semantics

Troubleshooting ​

Service won't start ​

1. Check role probe ​

2. Check healthz and readyz ​

3. Migration failed ​

NATS consumer fell behind ​

Halt latch is on ​

1. Identify which breaker tripped ​

2. Investigate the root cause ​

3. Clear the halt after fixing the root cause ​

Postgres LISTEN/NOTIFY missed ​

BuildKit OOM during e2e ​

Check for re-introduced parallelism ​

Disk low ​

Crash recovery ​

Service reports degraded boot ​

Related ​

Troubleshooting

Service won't start

1. Check role probe

2. Check healthz and readyz

3. Migration failed

NATS consumer fell behind

Halt latch is on

1. Identify which breaker tripped

2. Investigate the root cause

3. Clear the halt after fixing the root cause

Postgres LISTEN/NOTIFY missed

BuildKit OOM during e2e

Check for re-introduced parallelism

Disk low

Crash recovery

Service reports degraded boot

Related