fix(worker): R1 — RPO-driven customer-backup cadence (close plans.yaml ↔ cadence drift) by mastermanas805 · Pull Request #103 · InstaNode-dev/worker

mastermanas805 · 2026-06-10T03:58:29Z

R1 — customer-Postgres backup cadence is now driven by `plans.yaml` `rpo_minutes`

Problem

The customer-Postgres backup scheduler (internal/jobs/customer_backup_scheduler.go) picked hourly-vs-daily cadence from a hardcoded switch canonicalTier() plus a hardcoded SQL tier allow-list. The hourly tiers (pro/growth/team) happened to match their advertised rpo_minutes:60, but the coupling was implicit: editing rpo_minutes in ../api/plans.yaml would NOT have moved the cadence. The product could silently over-promise RPO — sell rpo_minutes:60, deliver an effective ~24h RPO — exactly the drift R1 calls out.

Fix

Make the cadence decision read the per-tier rpo_minutes from plans.yaml — the same field surfaced on /api/v1/capabilities — so the cadence that makes the RPO true is derived from the field that promises it. They can no longer drift.

`rpo_minutes`	cadence	tiers
`[1, 60]`	hourly (every tick)	pro / pro_yearly / growth / team / team_yearly = 60
`> 60`	once-daily at the team's deterministic slot	hobby / hobby_plus (+ _yearly) = 1440
`0` (or `backup_retention_days == 0`)	never enqueued	anonymous / free

What changed

BackupPlanRegistry gains RPOMinutes(tier) int; the common-plans adapter delegates to the pre-existing common/plans.Registry.RPOMinutes. The test fake implements it too.
NewCustomerBackupSchedulerWorker(db, plans) now takes the registry — wired in workers.go from the same backupPlans the runner already uses. A nil registry falls back to the legacy hardcoded mapping so a misconfigured boot never silently stops paid backups.
The candidate SELECT excludes only the never-backup tiers (anonymous/free) instead of an explicit paid-tier allow-list, so a new paid tier in plans.yaml is covered automatically — this removes the single-site-list failure mode (root rule 18) that originally dropped hobby_plus + every _yearly variant for hours.
Per-row retention <= 0 is a second, independent veto (defence-in-depth): a stray rpo_minutes on a zero-retention tier still never enqueues.

Idempotency (respected, unchanged)

The atomic INSERT … SELECT … WHERE NOT EXISTS (… INTERVAL '50 minutes') is kept — dedupe lives in the DB, not River UniqueOpts. This survives River periodic + RunOnStart double-ticks on worker restart (the WeeklyDigest-fired-daily failure mode). A second enqueue within the hour collapses to RowsAffected=0.

Tests (unit / in-memory only — no real backups triggered)

TestScheduler_InsertsForProTierEveryHour + TestScheduler_ProDueEveryHour_AcrossAllHours — a pro resource becomes due every hour (all 24).
TestScheduler_FreeTierNeverEnqueued + TestScheduler_FreeTierGate_SkipsEvenIfRowLeaksThrough — a free resource is never enqueued (SQL-filtered AND registry-gated if it leaks past the WHERE).
TestScheduler_NoDuplicateEnqueueWithinHour + TestScheduler_DedupeIsAtomicInsert + TestScheduler_DedupeLookbackIsUnderOneHour — no duplicate enqueue within an hour.
TestCadenceForTier_RPODriven (registry-iterating, not a hand-typed list) / _HourlyCutoffBoundary / _ZeroRetentionNeverBacksUp / _NilRegistryLegacyFallback.
TestCommonPlanRegistryAdapter_Delegates extended to assert RPOMinutes against the real embedded plans.yaml (pro/growth/team = 60, anonymous = 0, hobby > 60).

No integration path that triggers a real pg_dump was added — a real hourly backup against prod customer DBs is destructive-adjacent. The scheduler only INSERTs pending rows; all assertions are against a sqlmock'd DB. No testhelpers helper was added (no per-package-coverage trap).

Gate

make gate green locally — go build ./..., go vet ./..., go test ./... -short -count=1 all pass. internal/jobs package at 96.0% coverage; all new/changed funcs at 100% (cadenceForTier, legacyCadenceForTier, RPOMinutes, ctor, Work).

⚠️ R2 FLAG (NOT fixed here — needs a separate change before the durability copy is honest)

Mongo / Redis / object-storage have ZERO automated backup, despite the marketing promise of "backups + 1-click restore." The entire backup ladder in this repo (customer_backup_scheduler / customer_backup_runner / customer_restore_runner) is Postgres/vector only — it pg_dumps resource_type IN ('postgres','vector') and nothing else. There is no mongodump, no Redis RDB/AOF snapshot, no S3 bucket-to-bucket sync job.

Action: until mongodump + a bucket-sync (and their restore paths) ship, every durability/backup surface — the Pro "30-day backups + 1-click restore" headline, content/llms.txt, the dashboard backup copy, /api/v1/capabilities — must say "Postgres/vector only." A Mongo or object-storage customer who reads "backups included" today gets nothing. Recommend a follow-up ticket (likely api + worker + marketing in one contract PR per root rule 22).

🤖 Generated with Claude Code

…nutes The customer-Postgres backup scheduler chose hourly-vs-daily cadence from a hardcoded `switch canonicalTier()` plus a hardcoded SQL tier allow-list. The hourly tiers (pro/growth/team) happened to match their advertised `rpo_minutes:60`, but the coupling was implicit: changing rpo_minutes in plans.yaml would NOT have moved the cadence, so the product could silently over-promise RPO (effective ~24h vs sold 60m). Make the cadence RPO-driven from plans.yaml (the same field surfaced on /api/v1/capabilities), so the cadence that MAKES the RPO true is read from the field that PROMISES it — they can no longer drift: rpo_minutes in [1,60] → hourly (pro/growth/team) rpo_minutes > 60 → once-daily team slot (hobby/hobby_plus = 1440) rpo_minutes == 0 OR backup_retention_days == 0 → never (anonymous/free) - BackupPlanRegistry grows RPOMinutes(tier); the common-plans adapter and the test fake implement it. Registry.RPOMinutes already existed in common/plans. - Scheduler takes the registry (wired from the same `backupPlans` the runner uses); nil registry falls back to the legacy hardcoded mapping so a misconfigured boot never silently stops paid backups. - Candidate SELECT now excludes only the never-backup tiers (anonymous/free) instead of an explicit paid-tier allow-list, so a NEW paid tier in plans.yaml is covered automatically — removing the single-site-list failure mode (root rule 18) that once dropped hobby_plus + every _yearly variant. - Per-row retention==0 guard is a second, independent veto (defence-in-depth) so a stray rpo_minutes on a zero-retention tier still never enqueues. - Idempotency unchanged: the atomic INSERT … WHERE NOT EXISTS (50-min DB lookback) still prevents a duplicate enqueue within the hour even across River RunOnStart double-ticks (dedupe lives in the DB, not River UniqueOpts). Tests prove: a pro resource becomes due every hour (all 24 hours), a free resource is never enqueued (SQL-filtered AND registry-gated if it leaks), no duplicate enqueue within an hour, the [1,60]→hourly / >60→daily boundary, the zero-retention veto, the nil-registry legacy fallback, and the adapter's RPOMinutes delegation against the real embedded plans.yaml. jobs package at 96.0% coverage; all new/changed functions at 100%. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…104) * feat(worker): R2 — extend customer-backup ladder to mongodb + redis The backup ladder backed up postgres/vector ONLY, but the product sells "backups + 1-click restore" for ALL paid resources — Mongo/Redis had ZERO automated backup (worker #103 note + GAP-AUDIT-2026-06-10). Mirrors the existing pg/vector path (cadence/retention/tier-gating/S3 sink/<1KB sanity via the shared pipeline) for the new types: - backup_dump.go: mongoDumpRunner (`mongodump --archive`) + redisDumpRunner (`redis-cli --rdb -`) behind the same seam as pgDumpRunner. Both write a RAW archive into the runner's gzip pipeline (NOT mongodump --gzip — the pipeline owns compression, keeping one object layout + sha256 + restore story). Secret hygiene: mongo URI via a 0600 --config file, redis password via REDISCLI_AUTH env — credential never in argv (mirrors SEC-WORKER FINDING-2 PGPASSWORD on the pg path). - customer_backup_runner.go: dispatch on resource_type; nil/unsupported is a fail-open 'config' failure, never a panic. - customer_backup_scheduler.go: resource_type filter grows to ('postgres','vector','mongodb','redis'); cadence stays tier-driven so free/anonymous are still never enqueued. - customer_restore_runner.go: mongorestore --archive --drop branch (the rewind analogue of pg_restore --clean --if-exists). Redis RESTORE is the tracked follow-up (RDB restore-in-place needs pod-level access the worker lacks) — redis backups are still taken, sha-verified, and downloadable. - metrics: instant_customer_backup_by_type_total{resource_type,result} — per-type success/failure so "Mongo healthy, Redis silently failing" is visible (aggregate counters retained). Primed for all pairs in the test. - Dockerfile: apk add mongodb-tools redis so the new tools exist at runtime (postgres:16-alpine ships pg_dump only). Tests: scheduler enqueues mongo/redis for pro + never for free; runner dispatches to the right dumper + increments the per-type metric; unsupported type fails clean; mongo restore round-trips through gunzip; redis restore marks failed (guard before download); dump runners keep secrets out of argv; splitRedisURL + writeMongoConfig units. Gate green (build+vet+test -short), gofmt clean, golangci 0 issues. Operator follow-ups (separate infra repo, no auto-apply): NR alert + Prom rule + dashboard tile + METRICS-CATALOG row for the new counter (rule 25); object-storage bucket backup (DO Spaces versioning/replication); Redis restore; durability COPY must say "Postgres/vector/mongo only" (Redis = backup-only) until Redis restore ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(jobs): cover mongo/redis dump+restore error paths (100%-patch gate) The R2 mongo/redis backup+restore code left its failure arms untested, failing the diff-cover gate at 84%: - backup_dump.go: mongodump fail-open --uri-in-argv branch, mongodump exec-error return, all four writeMongoConfig error arms (CreateTemp/chmod/write/sync), redis-cli --tls arm (rediss://), redis-cli fail-open `-u <uri>` branch, redis-cli exec-error return. - customer_restore_runner.go: mongorestore fail-open --uri branch and mongorestore exec-error return. The chmod/write/sync failures cannot be forced against a healthy temp file, so writeMongoConfig's fs ops now route through injectable package-var seams (mongoCfgCreateTemp/Chmod/WriteURI/Sync) — same pattern as txtLookupFunc / deployNotifyResolver. Production behavior unchanged; tests swap + restore. Exec-error returns are exercised via fake failing CLI scripts on PATH (installFakeFailingBinary), mirroring installFakeBinary. Local profile now shows zero uncovered blocks in backup_dump.go and both flagged restore-runner ranges covered. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

mastermanas805 merged commit 4d82efb into master Jun 10, 2026
12 checks passed

mastermanas805 deleted the feat/backup-cadence-rpo-driven branch June 10, 2026 04:19

mastermanas805 mentioned this pull request Jun 10, 2026

feat(worker): R2 — extend customer-backup ladder to mongodb + redis #104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker): R1 — RPO-driven customer-backup cadence (close plans.yaml ↔ cadence drift)#103

fix(worker): R1 — RPO-driven customer-backup cadence (close plans.yaml ↔ cadence drift)#103
mastermanas805 merged 1 commit into
masterfrom
feat/backup-cadence-rpo-driven

mastermanas805 commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented Jun 10, 2026

R1 — customer-Postgres backup cadence is now driven by plans.yaml rpo_minutes

Problem

Fix

What changed

Idempotency (respected, unchanged)

Tests (unit / in-memory only — no real backups triggered)

Gate

⚠️ R2 FLAG (NOT fixed here — needs a separate change before the durability copy is honest)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

R1 — customer-Postgres backup cadence is now driven by `plans.yaml` `rpo_minutes`