Schema restore
When to use: the trading.* schema on srv-apps is gone, corrupted, or unreachable. Pi SSD failure, accidental DROP TABLE, accidental DROP DATABASE, or a partial migration that left the schema in a bad state.
This runbook restores trading.* only. market_data.* is re-fetchable via make dev-seed / gordon-data backfill and is not covered here.
Time budget: approximately 15 minutes of operator time on healthy hardware.
Escalation: if any step's actual output differs from "Expected output," stop. Do not improvise. Take a screenshot, file a postmortem, and fall back to the previous-known-good snapshot.
Prerequisites
- Shell access to srv-core (the restic backup destination) and srv-apps (the recovery host).
- The restic password from Ansible vault or password manager.
- Postgres running on the recovery host.
ssh srv-core 'echo srv-core ok'
ssh srv-apps 'echo srv-apps ok'Procedure
1. List available snapshots on srv-core
ssh srv-core
sudo -i
export RESTIC_REPOSITORY="/mnt/backups/srv-apps"
export RESTIC_PASSWORD='<from vault>'
restic snapshots --tag srv-apps | head -20Expected output: a table of snapshots, most-recent first, with srv-apps and scheduled tags. Hourly cadence — one snapshot per hour for the last 7 days.
ID Time Host Tags Paths
-------- ------------------- ---------- -------------------- -----------------------------------------
abc12345 2026-05-05 14:05:00 srv-apps srv-apps,scheduled /var/lib/postgres-dumps
def67890 2026-05-05 13:05:00 srv-apps srv-apps,scheduled /var/lib/postgres-dumpsPick the most recent snapshot strictly older than the time the corruption happened. Note the ID.
2. Restore the dump from restic to a staging area
Still on srv-core as root:
SNAPSHOT_ID=<paste-snapshot-id>
STAGING="/tmp/gordon-restore-$(date +%s)"
mkdir -p "$STAGING"
restic restore "$SNAPSHOT_ID" \
--target "$STAGING" \
--include /var/lib/postgres-dumps/gordon-trading.dump
ls -lh "$STAGING/var/lib/postgres-dumps/"Expected output: a single file gordon-trading.dump sized 500 KiB to 5 MiB. If missing or under 1 KiB, stop — pick a different snapshot.
3. Copy the dump to srv-apps
DUMP_PATH="$STAGING/var/lib/postgres-dumps/gordon-trading.dump"
scp "$DUMP_PATH" srv-apps:/tmp/gordon-trading.dumpExpected output: gordon-trading.dump 100% <size> with no errors.
4. Confirm Postgres is running on srv-apps
ssh srv-apps
sudo -i
docker ps --filter name=postgres --format '{{.Names}}\t{{.Status}}'Expected output:
postgres Up X minutes (healthy)If Postgres is down:
cd /home/david/srv-apps
docker compose up -d postgres
docker compose logs --tail 50 postgresIf the gordon database does not exist (replacement-hardware scenario):
docker exec -i postgres psql -U postgres -c \
"SELECT 1 FROM pg_database WHERE datname='gordon';" | grep -q 1 || \
docker exec -i postgres psql -U postgres -c "CREATE DATABASE gordon;"If gordon-migrate has not been applied yet, run it now — the dump assumes DB roles exist:
ansible-playbook playbooks/srv-apps.yml --tags docker-stack \
-e '{"docker_stack_compose_files": ["docker-compose.gordon.yml"]}' \
-e 'docker_stack_force_pull=true'Wait for gordon-migrate to exit 0.
5. Run the restore script
Still on srv-apps with the dump at /tmp/gordon-trading.dump:
git -C /opt/gordon-workspace pull || \
git clone https://github.com/dlepaux/gordon-workspace /opt/gordon-workspace
POSTGRES_PASSWORD=$(grep '^POSTGRES_PASSWORD=' /home/david/srv-apps/.env | cut -d= -f2-)
docker run --rm \
--network srv-apps_webnet \
-e PGHOST=postgres \
-e PGPORT=5432 \
-e PGUSER=postgres \
-e PGPASSWORD="$POSTGRES_PASSWORD" \
-e PGDATABASE=gordon \
-v /tmp/gordon-trading.dump:/dump.bin:ro \
-v /opt/gordon-workspace/scripts/restore-trading-schema.sh:/restore.sh:ro \
-v /tmp:/tmp \
--entrypoint /bin/bash \
postgres:16 \
/restore.sh /dump.binYou will be prompted to type yes. Type it and press enter.
Expected output (abridged):
[INFO ] Pre-flight checks
[ OK ] Dump file present: /dump.bin (1474832 bytes)
[ OK ] Postgres reachable at postgres:5432/gordon
[ OK ] Dump contains trading.* objects
[WARN ] About to DROP SCHEMA trading CASCADE then pg_restore
Type 'yes' to proceed: yes
[ OK ] trading schema dropped
[ OK ] pg_restore completed
[ OK ] Restore complete: trading.* restored from /dump.bin
[INFO ] Next: start the runtime services and confirm the manager can list bots.If the script exits non-zero, stop. Read the trap message. Common recovery: psql -c 'DROP SCHEMA IF EXISTS trading CASCADE;' and re-run with a different snapshot.
6. Verify services see the restored schema
curl -fsS http://localhost:8083/api/bots | jq '. | length'
curl -fsS http://localhost:8082/healthz
curl -fsS http://localhost:3000/api/equity?run_id=<latest-run-id> | jq '. | length'Expected: HTTP 200 on every endpoint with JSON arrays matching the row counts from the restore output.
If any service returns 500 / "schema does not exist" / "relation not found":
Check container logs:
docker compose logs --tail 100 gordon-manager.Verify role grants — re-run gordon-migrate if grants are missing:
bashdocker compose run --rm gordon-migrate
7. Close out
# Capture post-restore row counts for postmortem
docker exec postgres psql -U postgres -d gordon -c "
SELECT relname, reltuples::bigint AS rows
FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname = 'trading' AND c.relkind = 'r'
ORDER BY relname;
" | tee /tmp/gordon-trading-post-restore-$(date +%s).txt
# Verify the next hourly backup picks up the restored schema
sudo systemctl list-timers '*restic*'
sudo cat /var/log/restic/backup.log | tail -50
# Clean up staging on srv-core
ssh srv-core 'sudo rm -rf /tmp/gordon-restore-*'
# Clean up dump on srv-apps
sudo rm /tmp/gordon-trading.dumpFile a postmortem under docs/postmortems/ using docs/conventions/postmortem-template.md. Include: pre-restore row counts, post-restore row counts, snapshot ID, time-to-restore, and anything in this runbook that did not match reality.
Drill cadence
This restore must be drilled once before the live cutover and once per quarter thereafter. The drill must use this runbook end-to-end against a disposable Postgres. Use RESTORE_SKIP_CONFIRM=1 to skip the interactive prompt during drills:
RESTORE_SKIP_CONFIRM=1 /restore.sh /dump.binDrill log goes in plan/research/restore-drill-<date>.md. A backup that has never been restored is a placebo.
Related
- Monitoring — confirming restic timer health
- Live trading — pre-live checklist references restore drill
- Incident response — escalated incident procedures