Skip to content

Schema restore

When to use: the trading.* schema on srv-apps is gone, corrupted, or unreachable. Pi SSD failure, accidental DROP TABLE, accidental DROP DATABASE, or a partial migration that left the schema in a bad state.

This runbook restores trading.* only. market_data.* is re-fetchable via make dev-seed / gordon-data backfill and is not covered here.

Time budget: approximately 15 minutes of operator time on healthy hardware.

Escalation: if any step's actual output differs from "Expected output," stop. Do not improvise. Take a screenshot, file a postmortem, and fall back to the previous-known-good snapshot.

Prerequisites

  • Shell access to srv-core (the restic backup destination) and srv-apps (the recovery host).
  • The restic password from Ansible vault or password manager.
  • Postgres running on the recovery host.
bash
ssh srv-core 'echo srv-core ok'
ssh srv-apps 'echo srv-apps ok'

Procedure

1. List available snapshots on srv-core

bash
ssh srv-core
sudo -i

export RESTIC_REPOSITORY="/mnt/backups/srv-apps"
export RESTIC_PASSWORD='<from vault>'

restic snapshots --tag srv-apps | head -20

Expected output: a table of snapshots, most-recent first, with srv-apps and scheduled tags. Hourly cadence — one snapshot per hour for the last 7 days.

ID        Time                 Host        Tags                  Paths
--------  -------------------  ----------  --------------------  -----------------------------------------
abc12345  2026-05-05 14:05:00  srv-apps    srv-apps,scheduled    /var/lib/postgres-dumps
def67890  2026-05-05 13:05:00  srv-apps    srv-apps,scheduled    /var/lib/postgres-dumps

Pick the most recent snapshot strictly older than the time the corruption happened. Note the ID.

2. Restore the dump from restic to a staging area

Still on srv-core as root:

bash
SNAPSHOT_ID=<paste-snapshot-id>
STAGING="/tmp/gordon-restore-$(date +%s)"
mkdir -p "$STAGING"

restic restore "$SNAPSHOT_ID" \
  --target "$STAGING" \
  --include /var/lib/postgres-dumps/gordon-trading.dump

ls -lh "$STAGING/var/lib/postgres-dumps/"

Expected output: a single file gordon-trading.dump sized 500 KiB to 5 MiB. If missing or under 1 KiB, stop — pick a different snapshot.

3. Copy the dump to srv-apps

bash
DUMP_PATH="$STAGING/var/lib/postgres-dumps/gordon-trading.dump"
scp "$DUMP_PATH" srv-apps:/tmp/gordon-trading.dump

Expected output: gordon-trading.dump 100% <size> with no errors.

4. Confirm Postgres is running on srv-apps

bash
ssh srv-apps
sudo -i

docker ps --filter name=postgres --format '{{.Names}}\t{{.Status}}'

Expected output:

postgres    Up X minutes (healthy)

If Postgres is down:

bash
cd /home/david/srv-apps
docker compose up -d postgres
docker compose logs --tail 50 postgres

If the gordon database does not exist (replacement-hardware scenario):

bash
docker exec -i postgres psql -U postgres -c \
  "SELECT 1 FROM pg_database WHERE datname='gordon';" | grep -q 1 || \
  docker exec -i postgres psql -U postgres -c "CREATE DATABASE gordon;"

If gordon-migrate has not been applied yet, run it now — the dump assumes DB roles exist:

bash
ansible-playbook playbooks/srv-apps.yml --tags docker-stack \
  -e '{"docker_stack_compose_files": ["docker-compose.gordon.yml"]}' \
  -e 'docker_stack_force_pull=true'

Wait for gordon-migrate to exit 0.

5. Run the restore script

Still on srv-apps with the dump at /tmp/gordon-trading.dump:

bash
git -C /opt/gordon-workspace pull || \
  git clone https://github.com/dlepaux/gordon-workspace /opt/gordon-workspace

POSTGRES_PASSWORD=$(grep '^POSTGRES_PASSWORD=' /home/david/srv-apps/.env | cut -d= -f2-)

docker run --rm \
  --network srv-apps_webnet \
  -e PGHOST=postgres \
  -e PGPORT=5432 \
  -e PGUSER=postgres \
  -e PGPASSWORD="$POSTGRES_PASSWORD" \
  -e PGDATABASE=gordon \
  -v /tmp/gordon-trading.dump:/dump.bin:ro \
  -v /opt/gordon-workspace/scripts/restore-trading-schema.sh:/restore.sh:ro \
  -v /tmp:/tmp \
  --entrypoint /bin/bash \
  postgres:16 \
  /restore.sh /dump.bin

You will be prompted to type yes. Type it and press enter.

Expected output (abridged):

[INFO ] Pre-flight checks
[ OK  ] Dump file present: /dump.bin (1474832 bytes)
[ OK  ] Postgres reachable at postgres:5432/gordon
[ OK  ] Dump contains trading.* objects
[WARN ] About to DROP SCHEMA trading CASCADE then pg_restore
Type 'yes' to proceed: yes
[ OK  ] trading schema dropped
[ OK  ] pg_restore completed
[ OK  ] Restore complete: trading.* restored from /dump.bin
[INFO ] Next: start the runtime services and confirm the manager can list bots.

If the script exits non-zero, stop. Read the trap message. Common recovery: psql -c 'DROP SCHEMA IF EXISTS trading CASCADE;' and re-run with a different snapshot.

6. Verify services see the restored schema

bash
curl -fsS http://localhost:8083/api/bots | jq '. | length'
curl -fsS http://localhost:8082/healthz
curl -fsS http://localhost:3000/api/equity?run_id=<latest-run-id> | jq '. | length'

Expected: HTTP 200 on every endpoint with JSON arrays matching the row counts from the restore output.

If any service returns 500 / "schema does not exist" / "relation not found":

  1. Check container logs: docker compose logs --tail 100 gordon-manager.

  2. Verify role grants — re-run gordon-migrate if grants are missing:

    bash
    docker compose run --rm gordon-migrate

7. Close out

bash
# Capture post-restore row counts for postmortem
docker exec postgres psql -U postgres -d gordon -c "
  SELECT relname, reltuples::bigint AS rows
  FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
  WHERE n.nspname = 'trading' AND c.relkind = 'r'
  ORDER BY relname;
" | tee /tmp/gordon-trading-post-restore-$(date +%s).txt

# Verify the next hourly backup picks up the restored schema
sudo systemctl list-timers '*restic*'
sudo cat /var/log/restic/backup.log | tail -50

# Clean up staging on srv-core
ssh srv-core 'sudo rm -rf /tmp/gordon-restore-*'

# Clean up dump on srv-apps
sudo rm /tmp/gordon-trading.dump

File a postmortem under docs/postmortems/ using docs/conventions/postmortem-template.md. Include: pre-restore row counts, post-restore row counts, snapshot ID, time-to-restore, and anything in this runbook that did not match reality.

Drill cadence

This restore must be drilled once before the live cutover and once per quarter thereafter. The drill must use this runbook end-to-end against a disposable Postgres. Use RESTORE_SKIP_CONFIRM=1 to skip the interactive prompt during drills:

bash
RESTORE_SKIP_CONFIRM=1 /restore.sh /dump.bin

Drill log goes in plan/research/restore-drill-<date>.md. A backup that has never been restored is a placebo.

Gordon — keep compounding without blowing up