fix(worker): R1 — RPO-driven customer-backup cadence (close plans.yaml ↔ cadence drift)#103
Merged
Merged
Conversation
…nutes The customer-Postgres backup scheduler chose hourly-vs-daily cadence from a hardcoded `switch canonicalTier()` plus a hardcoded SQL tier allow-list. The hourly tiers (pro/growth/team) happened to match their advertised `rpo_minutes:60`, but the coupling was implicit: changing rpo_minutes in plans.yaml would NOT have moved the cadence, so the product could silently over-promise RPO (effective ~24h vs sold 60m). Make the cadence RPO-driven from plans.yaml (the same field surfaced on /api/v1/capabilities), so the cadence that MAKES the RPO true is read from the field that PROMISES it — they can no longer drift: rpo_minutes in [1,60] → hourly (pro/growth/team) rpo_minutes > 60 → once-daily team slot (hobby/hobby_plus = 1440) rpo_minutes == 0 OR backup_retention_days == 0 → never (anonymous/free) - BackupPlanRegistry grows RPOMinutes(tier); the common-plans adapter and the test fake implement it. Registry.RPOMinutes already existed in common/plans. - Scheduler takes the registry (wired from the same `backupPlans` the runner uses); nil registry falls back to the legacy hardcoded mapping so a misconfigured boot never silently stops paid backups. - Candidate SELECT now excludes only the never-backup tiers (anonymous/free) instead of an explicit paid-tier allow-list, so a NEW paid tier in plans.yaml is covered automatically — removing the single-site-list failure mode (root rule 18) that once dropped hobby_plus + every _yearly variant. - Per-row retention==0 guard is a second, independent veto (defence-in-depth) so a stray rpo_minutes on a zero-retention tier still never enqueues. - Idempotency unchanged: the atomic INSERT … WHERE NOT EXISTS (50-min DB lookback) still prevents a duplicate enqueue within the hour even across River RunOnStart double-ticks (dedupe lives in the DB, not River UniqueOpts). Tests prove: a pro resource becomes due every hour (all 24 hours), a free resource is never enqueued (SQL-filtered AND registry-gated if it leaks), no duplicate enqueue within an hour, the [1,60]→hourly / >60→daily boundary, the zero-retention veto, the nil-registry legacy fallback, and the adapter's RPOMinutes delegation against the real embedded plans.yaml. jobs package at 96.0% coverage; all new/changed functions at 100%. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
Jun 10, 2026
…104) * feat(worker): R2 — extend customer-backup ladder to mongodb + redis The backup ladder backed up postgres/vector ONLY, but the product sells "backups + 1-click restore" for ALL paid resources — Mongo/Redis had ZERO automated backup (worker #103 note + GAP-AUDIT-2026-06-10). Mirrors the existing pg/vector path (cadence/retention/tier-gating/S3 sink/<1KB sanity via the shared pipeline) for the new types: - backup_dump.go: mongoDumpRunner (`mongodump --archive`) + redisDumpRunner (`redis-cli --rdb -`) behind the same seam as pgDumpRunner. Both write a RAW archive into the runner's gzip pipeline (NOT mongodump --gzip — the pipeline owns compression, keeping one object layout + sha256 + restore story). Secret hygiene: mongo URI via a 0600 --config file, redis password via REDISCLI_AUTH env — credential never in argv (mirrors SEC-WORKER FINDING-2 PGPASSWORD on the pg path). - customer_backup_runner.go: dispatch on resource_type; nil/unsupported is a fail-open 'config' failure, never a panic. - customer_backup_scheduler.go: resource_type filter grows to ('postgres','vector','mongodb','redis'); cadence stays tier-driven so free/anonymous are still never enqueued. - customer_restore_runner.go: mongorestore --archive --drop branch (the rewind analogue of pg_restore --clean --if-exists). Redis RESTORE is the tracked follow-up (RDB restore-in-place needs pod-level access the worker lacks) — redis backups are still taken, sha-verified, and downloadable. - metrics: instant_customer_backup_by_type_total{resource_type,result} — per-type success/failure so "Mongo healthy, Redis silently failing" is visible (aggregate counters retained). Primed for all pairs in the test. - Dockerfile: apk add mongodb-tools redis so the new tools exist at runtime (postgres:16-alpine ships pg_dump only). Tests: scheduler enqueues mongo/redis for pro + never for free; runner dispatches to the right dumper + increments the per-type metric; unsupported type fails clean; mongo restore round-trips through gunzip; redis restore marks failed (guard before download); dump runners keep secrets out of argv; splitRedisURL + writeMongoConfig units. Gate green (build+vet+test -short), gofmt clean, golangci 0 issues. Operator follow-ups (separate infra repo, no auto-apply): NR alert + Prom rule + dashboard tile + METRICS-CATALOG row for the new counter (rule 25); object-storage bucket backup (DO Spaces versioning/replication); Redis restore; durability COPY must say "Postgres/vector/mongo only" (Redis = backup-only) until Redis restore ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(jobs): cover mongo/redis dump+restore error paths (100%-patch gate) The R2 mongo/redis backup+restore code left its failure arms untested, failing the diff-cover gate at 84%: - backup_dump.go: mongodump fail-open --uri-in-argv branch, mongodump exec-error return, all four writeMongoConfig error arms (CreateTemp/chmod/write/sync), redis-cli --tls arm (rediss://), redis-cli fail-open `-u <uri>` branch, redis-cli exec-error return. - customer_restore_runner.go: mongorestore fail-open --uri branch and mongorestore exec-error return. The chmod/write/sync failures cannot be forced against a healthy temp file, so writeMongoConfig's fs ops now route through injectable package-var seams (mongoCfgCreateTemp/Chmod/WriteURI/Sync) — same pattern as txtLookupFunc / deployNotifyResolver. Production behavior unchanged; tests swap + restore. Exec-error returns are exercised via fake failing CLI scripts on PATH (installFakeFailingBinary), mirroring installFakeBinary. Local profile now shows zero uncovered blocks in backup_dump.go and both flagged restore-runner ranges covered. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
R1 — customer-Postgres backup cadence is now driven by
plans.yamlrpo_minutesProblem
The customer-Postgres backup scheduler (
internal/jobs/customer_backup_scheduler.go) picked hourly-vs-daily cadence from a hardcodedswitch canonicalTier()plus a hardcoded SQL tier allow-list. The hourly tiers (pro/growth/team) happened to match their advertisedrpo_minutes:60, but the coupling was implicit: editingrpo_minutesin../api/plans.yamlwould NOT have moved the cadence. The product could silently over-promise RPO — sellrpo_minutes:60, deliver an effective ~24h RPO — exactly the drift R1 calls out.Fix
Make the cadence decision read the per-tier
rpo_minutesfromplans.yaml— the same field surfaced on/api/v1/capabilities— so the cadence that makes the RPO true is derived from the field that promises it. They can no longer drift.rpo_minutes[1, 60]> 600(orbackup_retention_days == 0)What changed
BackupPlanRegistrygainsRPOMinutes(tier) int; the common-plans adapter delegates to the pre-existingcommon/plans.Registry.RPOMinutes. The test fake implements it too.NewCustomerBackupSchedulerWorker(db, plans)now takes the registry — wired inworkers.gofrom the samebackupPlansthe runner already uses. A nil registry falls back to the legacy hardcoded mapping so a misconfigured boot never silently stops paid backups.anonymous/free) instead of an explicit paid-tier allow-list, so a new paid tier inplans.yamlis covered automatically — this removes the single-site-list failure mode (root rule 18) that originally droppedhobby_plus+ every_yearlyvariant for hours.retention <= 0is a second, independent veto (defence-in-depth): a strayrpo_minuteson a zero-retention tier still never enqueues.Idempotency (respected, unchanged)
The atomic
INSERT … SELECT … WHERE NOT EXISTS (… INTERVAL '50 minutes')is kept — dedupe lives in the DB, not RiverUniqueOpts. This survives River periodic +RunOnStartdouble-ticks on worker restart (the WeeklyDigest-fired-daily failure mode). A second enqueue within the hour collapses toRowsAffected=0.Tests (unit / in-memory only — no real backups triggered)
TestScheduler_InsertsForProTierEveryHour+TestScheduler_ProDueEveryHour_AcrossAllHours— a pro resource becomes due every hour (all 24).TestScheduler_FreeTierNeverEnqueued+TestScheduler_FreeTierGate_SkipsEvenIfRowLeaksThrough— a free resource is never enqueued (SQL-filtered AND registry-gated if it leaks past the WHERE).TestScheduler_NoDuplicateEnqueueWithinHour+TestScheduler_DedupeIsAtomicInsert+TestScheduler_DedupeLookbackIsUnderOneHour— no duplicate enqueue within an hour.TestCadenceForTier_RPODriven(registry-iterating, not a hand-typed list) /_HourlyCutoffBoundary/_ZeroRetentionNeverBacksUp/_NilRegistryLegacyFallback.TestCommonPlanRegistryAdapter_Delegatesextended to assertRPOMinutesagainst the real embeddedplans.yaml(pro/growth/team = 60, anonymous = 0, hobby > 60).Gate
make gategreen locally —go build ./...,go vet ./...,go test ./... -short -count=1all pass.internal/jobspackage at 96.0% coverage; all new/changed funcs at 100% (cadenceForTier,legacyCadenceForTier,RPOMinutes, ctor,Work).Mongo / Redis / object-storage have ZERO automated backup, despite the marketing promise of "backups + 1-click restore." The entire backup ladder in this repo (
customer_backup_scheduler/customer_backup_runner/customer_restore_runner) is Postgres/vector only — itpg_dumpsresource_type IN ('postgres','vector')and nothing else. There is nomongodump, no Redis RDB/AOF snapshot, no S3 bucket-to-bucket sync job.Action: until
mongodump+ a bucket-sync (and their restore paths) ship, every durability/backup surface — the Pro "30-day backups + 1-click restore" headline,content/llms.txt, the dashboard backup copy,/api/v1/capabilities— must say "Postgres/vector only." A Mongo or object-storage customer who reads "backups included" today gets nothing. Recommend a follow-up ticket (likely api + worker + marketing in one contract PR per root rule 22).🤖 Generated with Claude Code