Skip to content

fix(worker): R1 — RPO-driven customer-backup cadence (close plans.yaml ↔ cadence drift)#103

Merged
mastermanas805 merged 1 commit into
masterfrom
feat/backup-cadence-rpo-driven
Jun 10, 2026
Merged

fix(worker): R1 — RPO-driven customer-backup cadence (close plans.yaml ↔ cadence drift)#103
mastermanas805 merged 1 commit into
masterfrom
feat/backup-cadence-rpo-driven

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

R1 — customer-Postgres backup cadence is now driven by plans.yaml rpo_minutes

Problem

The customer-Postgres backup scheduler (internal/jobs/customer_backup_scheduler.go) picked hourly-vs-daily cadence from a hardcoded switch canonicalTier() plus a hardcoded SQL tier allow-list. The hourly tiers (pro/growth/team) happened to match their advertised rpo_minutes:60, but the coupling was implicit: editing rpo_minutes in ../api/plans.yaml would NOT have moved the cadence. The product could silently over-promise RPO — sell rpo_minutes:60, deliver an effective ~24h RPO — exactly the drift R1 calls out.

Fix

Make the cadence decision read the per-tier rpo_minutes from plans.yaml — the same field surfaced on /api/v1/capabilities — so the cadence that makes the RPO true is derived from the field that promises it. They can no longer drift.

rpo_minutes cadence tiers
[1, 60] hourly (every tick) pro / pro_yearly / growth / team / team_yearly = 60
> 60 once-daily at the team's deterministic slot hobby / hobby_plus (+ _yearly) = 1440
0 (or backup_retention_days == 0) never enqueued anonymous / free

What changed

  • BackupPlanRegistry gains RPOMinutes(tier) int; the common-plans adapter delegates to the pre-existing common/plans.Registry.RPOMinutes. The test fake implements it too.
  • NewCustomerBackupSchedulerWorker(db, plans) now takes the registry — wired in workers.go from the same backupPlans the runner already uses. A nil registry falls back to the legacy hardcoded mapping so a misconfigured boot never silently stops paid backups.
  • The candidate SELECT excludes only the never-backup tiers (anonymous/free) instead of an explicit paid-tier allow-list, so a new paid tier in plans.yaml is covered automatically — this removes the single-site-list failure mode (root rule 18) that originally dropped hobby_plus + every _yearly variant for hours.
  • Per-row retention <= 0 is a second, independent veto (defence-in-depth): a stray rpo_minutes on a zero-retention tier still never enqueues.

Idempotency (respected, unchanged)

The atomic INSERT … SELECT … WHERE NOT EXISTS (… INTERVAL '50 minutes') is kept — dedupe lives in the DB, not River UniqueOpts. This survives River periodic + RunOnStart double-ticks on worker restart (the WeeklyDigest-fired-daily failure mode). A second enqueue within the hour collapses to RowsAffected=0.

Tests (unit / in-memory only — no real backups triggered)

  • TestScheduler_InsertsForProTierEveryHour + TestScheduler_ProDueEveryHour_AcrossAllHours — a pro resource becomes due every hour (all 24).
  • TestScheduler_FreeTierNeverEnqueued + TestScheduler_FreeTierGate_SkipsEvenIfRowLeaksThrough — a free resource is never enqueued (SQL-filtered AND registry-gated if it leaks past the WHERE).
  • TestScheduler_NoDuplicateEnqueueWithinHour + TestScheduler_DedupeIsAtomicInsert + TestScheduler_DedupeLookbackIsUnderOneHourno duplicate enqueue within an hour.
  • TestCadenceForTier_RPODriven (registry-iterating, not a hand-typed list) / _HourlyCutoffBoundary / _ZeroRetentionNeverBacksUp / _NilRegistryLegacyFallback.
  • TestCommonPlanRegistryAdapter_Delegates extended to assert RPOMinutes against the real embedded plans.yaml (pro/growth/team = 60, anonymous = 0, hobby > 60).

No integration path that triggers a real pg_dump was added — a real hourly backup against prod customer DBs is destructive-adjacent. The scheduler only INSERTs pending rows; all assertions are against a sqlmock'd DB. No testhelpers helper was added (no per-package-coverage trap).

Gate

make gate green locally — go build ./..., go vet ./..., go test ./... -short -count=1 all pass. internal/jobs package at 96.0% coverage; all new/changed funcs at 100% (cadenceForTier, legacyCadenceForTier, RPOMinutes, ctor, Work).


⚠️ R2 FLAG (NOT fixed here — needs a separate change before the durability copy is honest)

Mongo / Redis / object-storage have ZERO automated backup, despite the marketing promise of "backups + 1-click restore." The entire backup ladder in this repo (customer_backup_scheduler / customer_backup_runner / customer_restore_runner) is Postgres/vector only — it pg_dumps resource_type IN ('postgres','vector') and nothing else. There is no mongodump, no Redis RDB/AOF snapshot, no S3 bucket-to-bucket sync job.

Action: until mongodump + a bucket-sync (and their restore paths) ship, every durability/backup surface — the Pro "30-day backups + 1-click restore" headline, content/llms.txt, the dashboard backup copy, /api/v1/capabilities — must say "Postgres/vector only." A Mongo or object-storage customer who reads "backups included" today gets nothing. Recommend a follow-up ticket (likely api + worker + marketing in one contract PR per root rule 22).

🤖 Generated with Claude Code

…nutes

The customer-Postgres backup scheduler chose hourly-vs-daily cadence from a
hardcoded `switch canonicalTier()` plus a hardcoded SQL tier allow-list. The
hourly tiers (pro/growth/team) happened to match their advertised
`rpo_minutes:60`, but the coupling was implicit: changing rpo_minutes in
plans.yaml would NOT have moved the cadence, so the product could silently
over-promise RPO (effective ~24h vs sold 60m).

Make the cadence RPO-driven from plans.yaml (the same field surfaced on
/api/v1/capabilities), so the cadence that MAKES the RPO true is read from the
field that PROMISES it — they can no longer drift:

  rpo_minutes in [1,60]  → hourly  (pro/growth/team)
  rpo_minutes > 60       → once-daily team slot (hobby/hobby_plus = 1440)
  rpo_minutes == 0 OR backup_retention_days == 0 → never (anonymous/free)

- BackupPlanRegistry grows RPOMinutes(tier); the common-plans adapter and the
  test fake implement it. Registry.RPOMinutes already existed in common/plans.
- Scheduler takes the registry (wired from the same `backupPlans` the runner
  uses); nil registry falls back to the legacy hardcoded mapping so a
  misconfigured boot never silently stops paid backups.
- Candidate SELECT now excludes only the never-backup tiers (anonymous/free)
  instead of an explicit paid-tier allow-list, so a NEW paid tier in
  plans.yaml is covered automatically — removing the single-site-list failure
  mode (root rule 18) that once dropped hobby_plus + every _yearly variant.
- Per-row retention==0 guard is a second, independent veto (defence-in-depth)
  so a stray rpo_minutes on a zero-retention tier still never enqueues.
- Idempotency unchanged: the atomic INSERT … WHERE NOT EXISTS (50-min DB
  lookback) still prevents a duplicate enqueue within the hour even across
  River RunOnStart double-ticks (dedupe lives in the DB, not River UniqueOpts).

Tests prove: a pro resource becomes due every hour (all 24 hours), a free
resource is never enqueued (SQL-filtered AND registry-gated if it leaks), no
duplicate enqueue within an hour, the [1,60]→hourly / >60→daily boundary, the
zero-retention veto, the nil-registry legacy fallback, and the adapter's
RPOMinutes delegation against the real embedded plans.yaml. jobs package at
96.0% coverage; all new/changed functions at 100%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 4d82efb into master Jun 10, 2026
12 checks passed
@mastermanas805 mastermanas805 deleted the feat/backup-cadence-rpo-driven branch June 10, 2026 04:19
mastermanas805 added a commit that referenced this pull request Jun 10, 2026
…104)

* feat(worker): R2 — extend customer-backup ladder to mongodb + redis

The backup ladder backed up postgres/vector ONLY, but the product sells
"backups + 1-click restore" for ALL paid resources — Mongo/Redis had
ZERO automated backup (worker #103 note + GAP-AUDIT-2026-06-10).

Mirrors the existing pg/vector path (cadence/retention/tier-gating/S3
sink/<1KB sanity via the shared pipeline) for the new types:

- backup_dump.go: mongoDumpRunner (`mongodump --archive`) + redisDumpRunner
  (`redis-cli --rdb -`) behind the same seam as pgDumpRunner. Both write a
  RAW archive into the runner's gzip pipeline (NOT mongodump --gzip — the
  pipeline owns compression, keeping one object layout + sha256 + restore
  story). Secret hygiene: mongo URI via a 0600 --config file, redis password
  via REDISCLI_AUTH env — credential never in argv (mirrors SEC-WORKER
  FINDING-2 PGPASSWORD on the pg path).
- customer_backup_runner.go: dispatch on resource_type; nil/unsupported is a
  fail-open 'config' failure, never a panic.
- customer_backup_scheduler.go: resource_type filter grows to
  ('postgres','vector','mongodb','redis'); cadence stays tier-driven so
  free/anonymous are still never enqueued.
- customer_restore_runner.go: mongorestore --archive --drop branch (the
  rewind analogue of pg_restore --clean --if-exists). Redis RESTORE is the
  tracked follow-up (RDB restore-in-place needs pod-level access the worker
  lacks) — redis backups are still taken, sha-verified, and downloadable.
- metrics: instant_customer_backup_by_type_total{resource_type,result} —
  per-type success/failure so "Mongo healthy, Redis silently failing" is
  visible (aggregate counters retained). Primed for all pairs in the test.
- Dockerfile: apk add mongodb-tools redis so the new tools exist at runtime
  (postgres:16-alpine ships pg_dump only).

Tests: scheduler enqueues mongo/redis for pro + never for free; runner
dispatches to the right dumper + increments the per-type metric; unsupported
type fails clean; mongo restore round-trips through gunzip; redis restore
marks failed (guard before download); dump runners keep secrets out of argv;
splitRedisURL + writeMongoConfig units. Gate green (build+vet+test -short),
gofmt clean, golangci 0 issues.

Operator follow-ups (separate infra repo, no auto-apply): NR alert + Prom
rule + dashboard tile + METRICS-CATALOG row for the new counter (rule 25);
object-storage bucket backup (DO Spaces versioning/replication); Redis
restore; durability COPY must say "Postgres/vector/mongo only" (Redis =
backup-only) until Redis restore ships.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(jobs): cover mongo/redis dump+restore error paths (100%-patch gate)

The R2 mongo/redis backup+restore code left its failure arms untested,
failing the diff-cover gate at 84%:

- backup_dump.go: mongodump fail-open --uri-in-argv branch, mongodump
  exec-error return, all four writeMongoConfig error arms
  (CreateTemp/chmod/write/sync), redis-cli --tls arm (rediss://),
  redis-cli fail-open `-u <uri>` branch, redis-cli exec-error return.
- customer_restore_runner.go: mongorestore fail-open --uri branch and
  mongorestore exec-error return.

The chmod/write/sync failures cannot be forced against a healthy temp
file, so writeMongoConfig's fs ops now route through injectable
package-var seams (mongoCfgCreateTemp/Chmod/WriteURI/Sync) — same
pattern as txtLookupFunc / deployNotifyResolver. Production behavior
unchanged; tests swap + restore.

Exec-error returns are exercised via fake failing CLI scripts on PATH
(installFakeFailingBinary), mirroring installFakeBinary. Local profile
now shows zero uncovered blocks in backup_dump.go and both flagged
restore-runner ranges covered.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant