Monitoring
When to use: routine health checks, incident investigation, confirming alerts coverage after adding a new metric.
Prometheus, Grafana, and Loki all run on srv-apps as part of the support-services stack. The runtime v7 services (gordon-data, gordon-risk, gordon-manager, gordon-bot, gordon-executor) expose /metrics on their respective ports.
Alerts-coverage gate (DP-04)
Every Prometheus metric family registered by a v7 service MUST be paired with at least one alert rule in srv-apps/services/prometheus/rules/ OR appear in scripts/alerts-coverage-exemptions.txt with an explicit category tag (RED / PROCESS / DEBUG / COVERED).
The gate is enforced by scripts/check-alerts-coverage.sh in .githooks/pre-push. Adding an exemption is a deliberate decision — the reviewer's question is "is the failure mode this metric represents page-worthy?"
Adding a new metric without a paired alert rule or exemption entry will block the push.
Service metric families
gordon-data (port 8081)
| Metric family | Type | What it tells you |
|---|---|---|
gordon_data_klines_written_total | Counter | Kline rows written to market_data.* |
gordon_data_ingest_ws_reconnects_total | Counter | WebSocket reconnect events per source |
gordon_data_source_freshness_seconds | Gauge | Seconds since last write per source |
gordon_data_backfill_jobs_active | Gauge | Active backfill jobs |
gordon_data_backfill_rows_fetched_total | Counter | Rows fetched in backfill |
gordon_data_source_quarantined | Gauge | 1 if a source is quarantined |
gordon-risk (port 8082)
| Metric family | Type | What it tells you |
|---|---|---|
gordon_risk_breaker_state | Gauge | 1=tripped per breaker (drawdown/connectivity/vpin/macro/correlation) |
gordon_risk_halt_latch | Gauge | 1 when halt-latch is active |
gordon_risk_portfolio_drawdown_pct | Gauge | Peak-to-trough equity drawdown |
gordon_risk_vpin_value | Gauge | Current VPIN score |
gordon_risk_scheduler_ticks_total | Counter | Risk evaluation ticks |
gordon_risk_escalation_steps_total | Counter | Escalation state machine steps |
gordon-manager (port 8083)
| Metric family | Type | What it tells you |
|---|---|---|
gordon_manager_bots_active | Gauge | Running bot container count |
gordon_manager_reconciler_drift_total | Counter | Reconciler drift events |
gordon_manager_deploy_steps_total | Counter | Green/blue deploy steps |
gordon_manager_backtest_duration_seconds | Histogram | Backtest wall-clock time |
gordon_manager_ws_subscribers | Gauge | Active WS console subscribers |
gordon-executor (port 8085)
| Metric family | Type | What it tells you |
|---|---|---|
gordon_executor_orders_submitted_total | Counter | Orders submitted to Binance |
gordon_executor_orders_filled_total | Counter | Confirmed fills |
gordon_executor_cap_rejections_total | Counter | Notional cap rejections |
gordon_executor_reconcile_anomalies_total | Counter | Reconcile anomalies on startup |
gordon_executor_quarantine_state | Gauge | 1 when executor is quarantined |
gordon_executor_flatten_steps_total | Counter | Emergency flatten steps |
gordon-bot (port 8084)
| Metric family | Type | What it tells you |
|---|---|---|
gordon_bot_candles_processed_total | Counter | Candles evaluated by the strategy loop |
gordon_bot_intents_emitted_total | Counter | Trading intents emitted |
gordon_bot_warmup_duration_seconds | Histogram | Warmup call latency |
gordon_bot_lease_renewals_total | Counter | Advisory lock renewals |
gordon_bot_strategy_errors_total | Counter | Strategy evaluation errors |
Alert rules
Alert rules live in srv-apps/services/prometheus/rules/gordon-*.yml. One file per service. The alerts-coverage gate enforces that every metric family above is either paired with a rule or exempted.
Key alerts:
| Alert | Condition | Severity |
|---|---|---|
GordonServiceDown | up == 0 for 2 minutes | critical |
GordonHaltLatchActive | gordon_risk_halt_latch == 1 for 1 minute | critical |
GordonBreakerTripped | any gordon_risk_breaker_state == 1 | critical |
GordonHighDrawdown | gordon_risk_portfolio_drawdown_pct > 15 for 5 minutes | warning |
GordonCriticalDrawdown | gordon_risk_portfolio_drawdown_pct > 20 for 1 minute | critical |
GordonDataSourceStale | gordon_data_source_freshness_seconds > 300 | warning |
GordonExecutorQuarantined | gordon_executor_quarantine_state == 1 | critical |
GordonSourceQuarantined | gordon_data_source_quarantined == 1 | warning |
Grafana
Grafana runs on srv-apps. Dashboards:
- Gordon Overview — service up/down, halt state, drawdown, open positions.
- Gordon Data — source freshness per source, klines write rate, backfill progress.
- Gordon Executor — order submit rate, fill rate, cap rejection rate, reconcile anomalies.
- Gordon Risk — breaker states timeline, VPIN value, escalation steps.
- Gordon Bot — candle processing rate, intent rate, warmup durations.
Loki
Loki runs on srv-apps. Every v7 service emits structured JSON logs with trace_id, service, event, and code fields. Useful LogQL:
# All errors from gordon-executor in last 15 minutes
{service="gordon-executor"} |= `"level":"error"` | last 15m
# Halt-latch events
{service="gordon-risk"} |= `halt_latch`
# Specific error code
{service="gordon-bot"} |= `BOT_WARMUP_INCOMPLETE`
# Trace correlation across services
{} |= `"trace_id":"<uuid>"`Healthz and readyz probes
Every v7 service exposes:
GET /healthz— liveness. Returns 200 if the process is alive. Returns 503 if the service failed startup or is in a fatal state.GET /readyz— readiness. Returns 200 only when the service has completed warmup, acquired leases, and is ready to serve traffic. Returns 503 if role probe failed, warmup incomplete, or source stale past budget.
Quick multi-service health check:
for port in 8081 8082 8083 8084 8085; do
echo -n "port $port: "
curl -fsS --max-time 2 http://srv-apps:$port/healthz && echo ok || echo FAIL
doneDaily watch list
- All five services report
200on/healthz. gordon_risk_halt_latchis0.gordon_risk_portfolio_drawdown_pctis below threshold.gordon_data_source_freshness_secondsis within cadence for every active source.- No
gordon_executor_quarantine_state == 1. - Loki shows no unexpected
level=errorevents in the last hour.
Related
- Incident response — what to do when an alert fires
- Troubleshooting — common failure diagnostics
- Live trading — alert thresholds calibration