Skip to content

Monitoring

When to use: routine health checks, incident investigation, confirming alerts coverage after adding a new metric.

Prometheus, Grafana, and Loki all run on srv-apps as part of the support-services stack. The runtime v7 services (gordon-data, gordon-risk, gordon-manager, gordon-bot, gordon-executor) expose /metrics on their respective ports.

Alerts-coverage gate (DP-04)

Every Prometheus metric family registered by a v7 service MUST be paired with at least one alert rule in srv-apps/services/prometheus/rules/ OR appear in scripts/alerts-coverage-exemptions.txt with an explicit category tag (RED / PROCESS / DEBUG / COVERED).

The gate is enforced by scripts/check-alerts-coverage.sh in .githooks/pre-push. Adding an exemption is a deliberate decision — the reviewer's question is "is the failure mode this metric represents page-worthy?"

Adding a new metric without a paired alert rule or exemption entry will block the push.

Service metric families

gordon-data (port 8081)

Metric familyTypeWhat it tells you
gordon_data_klines_written_totalCounterKline rows written to market_data.*
gordon_data_ingest_ws_reconnects_totalCounterWebSocket reconnect events per source
gordon_data_source_freshness_secondsGaugeSeconds since last write per source
gordon_data_backfill_jobs_activeGaugeActive backfill jobs
gordon_data_backfill_rows_fetched_totalCounterRows fetched in backfill
gordon_data_source_quarantinedGauge1 if a source is quarantined

gordon-risk (port 8082)

Metric familyTypeWhat it tells you
gordon_risk_breaker_stateGauge1=tripped per breaker (drawdown/connectivity/vpin/macro/correlation)
gordon_risk_halt_latchGauge1 when halt-latch is active
gordon_risk_portfolio_drawdown_pctGaugePeak-to-trough equity drawdown
gordon_risk_vpin_valueGaugeCurrent VPIN score
gordon_risk_scheduler_ticks_totalCounterRisk evaluation ticks
gordon_risk_escalation_steps_totalCounterEscalation state machine steps

gordon-manager (port 8083)

Metric familyTypeWhat it tells you
gordon_manager_bots_activeGaugeRunning bot container count
gordon_manager_reconciler_drift_totalCounterReconciler drift events
gordon_manager_deploy_steps_totalCounterGreen/blue deploy steps
gordon_manager_backtest_duration_secondsHistogramBacktest wall-clock time
gordon_manager_ws_subscribersGaugeActive WS console subscribers

gordon-executor (port 8085)

Metric familyTypeWhat it tells you
gordon_executor_orders_submitted_totalCounterOrders submitted to Binance
gordon_executor_orders_filled_totalCounterConfirmed fills
gordon_executor_cap_rejections_totalCounterNotional cap rejections
gordon_executor_reconcile_anomalies_totalCounterReconcile anomalies on startup
gordon_executor_quarantine_stateGauge1 when executor is quarantined
gordon_executor_flatten_steps_totalCounterEmergency flatten steps

gordon-bot (port 8084)

Metric familyTypeWhat it tells you
gordon_bot_candles_processed_totalCounterCandles evaluated by the strategy loop
gordon_bot_intents_emitted_totalCounterTrading intents emitted
gordon_bot_warmup_duration_secondsHistogramWarmup call latency
gordon_bot_lease_renewals_totalCounterAdvisory lock renewals
gordon_bot_strategy_errors_totalCounterStrategy evaluation errors

Alert rules

Alert rules live in srv-apps/services/prometheus/rules/gordon-*.yml. One file per service. The alerts-coverage gate enforces that every metric family above is either paired with a rule or exempted.

Key alerts:

AlertConditionSeverity
GordonServiceDownup == 0 for 2 minutescritical
GordonHaltLatchActivegordon_risk_halt_latch == 1 for 1 minutecritical
GordonBreakerTrippedany gordon_risk_breaker_state == 1critical
GordonHighDrawdowngordon_risk_portfolio_drawdown_pct > 15 for 5 minuteswarning
GordonCriticalDrawdowngordon_risk_portfolio_drawdown_pct > 20 for 1 minutecritical
GordonDataSourceStalegordon_data_source_freshness_seconds > 300warning
GordonExecutorQuarantinedgordon_executor_quarantine_state == 1critical
GordonSourceQuarantinedgordon_data_source_quarantined == 1warning

Grafana

Grafana runs on srv-apps. Dashboards:

  • Gordon Overview — service up/down, halt state, drawdown, open positions.
  • Gordon Data — source freshness per source, klines write rate, backfill progress.
  • Gordon Executor — order submit rate, fill rate, cap rejection rate, reconcile anomalies.
  • Gordon Risk — breaker states timeline, VPIN value, escalation steps.
  • Gordon Bot — candle processing rate, intent rate, warmup durations.

Loki

Loki runs on srv-apps. Every v7 service emits structured JSON logs with trace_id, service, event, and code fields. Useful LogQL:

logql
# All errors from gordon-executor in last 15 minutes
{service="gordon-executor"} |= `"level":"error"` | last 15m

# Halt-latch events
{service="gordon-risk"} |= `halt_latch`

# Specific error code
{service="gordon-bot"} |= `BOT_WARMUP_INCOMPLETE`

# Trace correlation across services
{} |= `"trace_id":"<uuid>"`

Healthz and readyz probes

Every v7 service exposes:

  • GET /healthz — liveness. Returns 200 if the process is alive. Returns 503 if the service failed startup or is in a fatal state.
  • GET /readyz — readiness. Returns 200 only when the service has completed warmup, acquired leases, and is ready to serve traffic. Returns 503 if role probe failed, warmup incomplete, or source stale past budget.

Quick multi-service health check:

bash
for port in 8081 8082 8083 8084 8085; do
  echo -n "port $port: "
  curl -fsS --max-time 2 http://srv-apps:$port/healthz && echo ok || echo FAIL
done

Daily watch list

  1. All five services report 200 on /healthz.
  2. gordon_risk_halt_latch is 0.
  3. gordon_risk_portfolio_drawdown_pct is below threshold.
  4. gordon_data_source_freshness_seconds is within cadence for every active source.
  5. No gordon_executor_quarantine_state == 1.
  6. Loki shows no unexpected level=error events in the last hour.

Gordon — keep compounding without blowing up