Monitoring

When to use: routine health checks, incident investigation, confirming alerts coverage after adding a new metric.

Prometheus, Grafana, and Loki all run on srv-apps as part of the support-services stack. The runtime v7 services (gordon-data, gordon-risk, gordon-manager, gordon-bot, gordon-executor) expose /metrics on their respective ports.

Alerts-coverage gate (DP-04)

Every Prometheus metric family registered by a v7 service MUST be paired with at least one alert rule in srv-apps/services/prometheus/rules/ OR appear in scripts/alerts-coverage-exemptions.txt with an explicit category tag (RED / PROCESS / DEBUG / COVERED).

The gate is enforced by scripts/check-alerts-coverage.sh in .githooks/pre-push. Adding an exemption is a deliberate decision — the reviewer's question is "is the failure mode this metric represents page-worthy?"

Adding a new metric without a paired alert rule or exemption entry will block the push.

Service metric families

gordon-data (port 8081)

Metric family	Type	What it tells you
`gordon_data_klines_written_total`	Counter	Kline rows written to market_data.*
`gordon_data_ingest_ws_reconnects_total`	Counter	WebSocket reconnect events per source
`gordon_data_source_freshness_seconds`	Gauge	Seconds since last write per source
`gordon_data_backfill_jobs_active`	Gauge	Active backfill jobs
`gordon_data_backfill_rows_fetched_total`	Counter	Rows fetched in backfill
`gordon_data_source_quarantined`	Gauge	1 if a source is quarantined

gordon-risk (port 8082)

Metric family	Type	What it tells you
`gordon_risk_breaker_state`	Gauge	1=tripped per breaker (drawdown/connectivity/vpin/macro/correlation)
`gordon_risk_halt_latch`	Gauge	1 when halt-latch is active
`gordon_risk_portfolio_drawdown_pct`	Gauge	Peak-to-trough equity drawdown
`gordon_risk_vpin_value`	Gauge	Current VPIN score
`gordon_risk_scheduler_ticks_total`	Counter	Risk evaluation ticks
`gordon_risk_escalation_steps_total`	Counter	Escalation state machine steps

gordon-manager (port 8083)

Metric family	Type	What it tells you
`gordon_manager_bots_active`	Gauge	Running bot container count
`gordon_manager_reconciler_drift_total`	Counter	Reconciler drift events
`gordon_manager_deploy_steps_total`	Counter	Green/blue deploy steps
`gordon_manager_backtest_duration_seconds`	Histogram	Backtest wall-clock time
`gordon_manager_ws_subscribers`	Gauge	Active WS console subscribers

gordon-executor (port 8085)

Metric family	Type	What it tells you
`gordon_executor_orders_submitted_total`	Counter	Orders submitted to Binance
`gordon_executor_orders_filled_total`	Counter	Confirmed fills
`gordon_executor_cap_rejections_total`	Counter	Notional cap rejections
`gordon_executor_reconcile_anomalies_total`	Counter	Reconcile anomalies on startup
`gordon_executor_quarantine_state`	Gauge	1 when executor is quarantined
`gordon_executor_flatten_steps_total`	Counter	Emergency flatten steps

gordon-bot (port 8084)

Metric family	Type	What it tells you
`gordon_bot_candles_processed_total`	Counter	Candles evaluated by the strategy loop
`gordon_bot_intents_emitted_total`	Counter	Trading intents emitted
`gordon_bot_warmup_duration_seconds`	Histogram	Warmup call latency
`gordon_bot_lease_renewals_total`	Counter	Advisory lock renewals
`gordon_bot_strategy_errors_total`	Counter	Strategy evaluation errors

Alert rules

Alert rules live in srv-apps/services/prometheus/rules/gordon-*.yml. One file per service. The alerts-coverage gate enforces that every metric family above is either paired with a rule or exempted.

Key alerts:

Alert	Condition	Severity
`GordonServiceDown`	`up == 0` for 2 minutes	critical
`GordonHaltLatchActive`	`gordon_risk_halt_latch == 1` for 1 minute	critical
`GordonBreakerTripped`	any `gordon_risk_breaker_state == 1`	critical
`GordonHighDrawdown`	`gordon_risk_portfolio_drawdown_pct > 15` for 5 minutes	warning
`GordonCriticalDrawdown`	`gordon_risk_portfolio_drawdown_pct > 20` for 1 minute	critical
`GordonDataSourceStale`	`gordon_data_source_freshness_seconds > 300`	warning
`GordonExecutorQuarantined`	`gordon_executor_quarantine_state == 1`	critical
`GordonSourceQuarantined`	`gordon_data_source_quarantined == 1`	warning

Grafana

Grafana runs on srv-apps. Dashboards:

Gordon Overview — service up/down, halt state, drawdown, open positions.
Gordon Data — source freshness per source, klines write rate, backfill progress.
Gordon Executor — order submit rate, fill rate, cap rejection rate, reconcile anomalies.
Gordon Risk — breaker states timeline, VPIN value, escalation steps.
Gordon Bot — candle processing rate, intent rate, warmup durations.

Loki

Loki runs on srv-apps. Every v7 service emits structured JSON logs with trace_id, service, event, and code fields. Useful LogQL:

logql

# All errors from gordon-executor in last 15 minutes
{service="gordon-executor"} |= `"level":"error"` | last 15m

# Halt-latch events
{service="gordon-risk"} |= `halt_latch`

# Specific error code
{service="gordon-bot"} |= `BOT_WARMUP_INCOMPLETE`

# Trace correlation across services
{} |= `"trace_id":"<uuid>"`

Healthz and readyz probes

Every v7 service exposes:

GET /healthz — liveness. Returns 200 if the process is alive. Returns 503 if the service failed startup or is in a fatal state.
GET /readyz — readiness. Returns 200 only when the service has completed warmup, acquired leases, and is ready to serve traffic. Returns 503 if role probe failed, warmup incomplete, or source stale past budget.

Quick multi-service health check:

bash

for port in 8081 8082 8083 8084 8085; do
  echo -n "port $port: "
  curl -fsS --max-time 2 http://srv-apps:$port/healthz && echo ok || echo FAIL
done

Daily watch list

All five services report 200 on /healthz.
gordon_risk_halt_latch is 0.
gordon_risk_portfolio_drawdown_pct is below threshold.
gordon_data_source_freshness_seconds is within cadence for every active source.
No gordon_executor_quarantine_state == 1.
Loki shows no unexpected level=error events in the last hour.

Incident response — what to do when an alert fires
Troubleshooting — common failure diagnostics
Live trading — alert thresholds calibration

Monitoring ​

Alerts-coverage gate (DP-04) ​

Service metric families ​

gordon-data (port 8081) ​

gordon-risk (port 8082) ​

gordon-manager (port 8083) ​

gordon-executor (port 8085) ​

gordon-bot (port 8084) ​

Alert rules ​

Grafana ​

Loki ​

Healthz and readyz probes ​

Daily watch list ​

Related ​