feat(worker): R2 — extend customer-backup ladder to mongodb + redis#104
Merged
Conversation
The backup ladder backed up postgres/vector ONLY, but the product sells "backups + 1-click restore" for ALL paid resources — Mongo/Redis had ZERO automated backup (worker #103 note + GAP-AUDIT-2026-06-10). Mirrors the existing pg/vector path (cadence/retention/tier-gating/S3 sink/<1KB sanity via the shared pipeline) for the new types: - backup_dump.go: mongoDumpRunner (`mongodump --archive`) + redisDumpRunner (`redis-cli --rdb -`) behind the same seam as pgDumpRunner. Both write a RAW archive into the runner's gzip pipeline (NOT mongodump --gzip — the pipeline owns compression, keeping one object layout + sha256 + restore story). Secret hygiene: mongo URI via a 0600 --config file, redis password via REDISCLI_AUTH env — credential never in argv (mirrors SEC-WORKER FINDING-2 PGPASSWORD on the pg path). - customer_backup_runner.go: dispatch on resource_type; nil/unsupported is a fail-open 'config' failure, never a panic. - customer_backup_scheduler.go: resource_type filter grows to ('postgres','vector','mongodb','redis'); cadence stays tier-driven so free/anonymous are still never enqueued. - customer_restore_runner.go: mongorestore --archive --drop branch (the rewind analogue of pg_restore --clean --if-exists). Redis RESTORE is the tracked follow-up (RDB restore-in-place needs pod-level access the worker lacks) — redis backups are still taken, sha-verified, and downloadable. - metrics: instant_customer_backup_by_type_total{resource_type,result} — per-type success/failure so "Mongo healthy, Redis silently failing" is visible (aggregate counters retained). Primed for all pairs in the test. - Dockerfile: apk add mongodb-tools redis so the new tools exist at runtime (postgres:16-alpine ships pg_dump only). Tests: scheduler enqueues mongo/redis for pro + never for free; runner dispatches to the right dumper + increments the per-type metric; unsupported type fails clean; mongo restore round-trips through gunzip; redis restore marks failed (guard before download); dump runners keep secrets out of argv; splitRedisURL + writeMongoConfig units. Gate green (build+vet+test -short), gofmt clean, golangci 0 issues. Operator follow-ups (separate infra repo, no auto-apply): NR alert + Prom rule + dashboard tile + METRICS-CATALOG row for the new counter (rule 25); object-storage bucket backup (DO Spaces versioning/replication); Redis restore; durability COPY must say "Postgres/vector/mongo only" (Redis = backup-only) until Redis restore ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The R2 mongo/redis backup+restore code left its failure arms untested, failing the diff-cover gate at 84%: - backup_dump.go: mongodump fail-open --uri-in-argv branch, mongodump exec-error return, all four writeMongoConfig error arms (CreateTemp/chmod/write/sync), redis-cli --tls arm (rediss://), redis-cli fail-open `-u <uri>` branch, redis-cli exec-error return. - customer_restore_runner.go: mongorestore fail-open --uri branch and mongorestore exec-error return. The chmod/write/sync failures cannot be forced against a healthy temp file, so writeMongoConfig's fs ops now route through injectable package-var seams (mongoCfgCreateTemp/Chmod/WriteURI/Sync) — same pattern as txtLookupFunc / deployNotifyResolver. Production behavior unchanged; tests swap + restore. Exec-error returns are exercised via fake failing CLI scripts on PATH (installFakeFailingBinary), mirroring installFakeBinary. Local profile now shows zero uncovered blocks in backup_dump.go and both flagged restore-runner ranges covered. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
The customer-backup ladder backed up postgres/vector ONLY
(
customer_backup_scheduler.goenqueuedresource_type IN ('postgres','vector')),but the product sells "backups + 1-click restore" for ALL paid resources.
Mongo/Redis had ZERO automated backup — the durability gap flagged in the
worker #103 note + GAP-AUDIT-2026-06-10. This closes it by mirroring the
existing pg/vector path (same cadence / retention / tier-gating / S3 sink /
<1KBsanity guard, all via the shared pipeline) for the new types.What's backed up now
mongodump --archive→ gzip → S3mongorestore --archive --dropredis-cli --rdb -→ gzip → S3Redis backup ships (taken, sha-verified, retained, downloadable). Redis
restore is the tracked follow-up: an RDB restore-in-place needs pod-level
access (replace
dump.rdb+ restart, or a per-key RESTORE pass) that theworker can't drive from outside the pod. The restore runner marks a redis
restore row failed with that explicit reason before any S3 download.
Key design decisions
(uncompressed) archive into the runner's existing gzip writer —
mongodump --archive(NOT--gzip, which would double-compress under the pipeline'sgzip and break the gunzip→mongorestore symmetry),
redis-cli --rdb -streams the RDB to stdout. Object key, sha256-of-gzip, retention, keep-last-N,
and restore-gunzip logic are unchanged.
the customer credential never sits in argv during the backup window. Mongo
URI is passed via a
0600mongodump --configfile; Redis password via theREDISCLI_AUTHenv. Both fail open to argv on a parse/temp-file error(strictly better than no backup), same posture the pg path documents.
markFailedreason=
config, never a panic. The scheduler SQL only enqueues supportedtypes; the runner guard is defence-in-depth for a manual API backup.
backupSupportedResourceType()anchors both the scheduler SQL set and the runner dispatch (root rule 16/18).
Metrics (rule 25)
New
instant_customer_backup_by_type_total{resource_type, result}— per-typesuccess/failure so "Mongo backing up but Redis silently failing" is visible
(the aggregate
*_succeeded_total/*_failed_totalcounters are retained forthe existing dashboard +
customer-backup-failed.jsonalert). Primed for all(type,result) pairs in
metrics_test.goso the lazy *Vec series render fromprocess start.
Dockerfile
apk add --no-cache mongodb-tools redis—postgres:16-alpineshipspg_dumponly; without these the new exec calls would fail at runtime (the runner
fail-opens, but no Mongo/Redis backup would actually succeed).
Tests (all green)
Scheduler:
TestScheduler_EnqueuesMongoAndRedisForProTier,TestScheduler_FreeTierMongoNeverEnqueued,TestScheduler_FreeMongoGate_SkipsEvenIfRowLeaksThrough,TestScheduler_SQLFilterIncludesAllSupportedTypes.Runner:
TestRunner_MongoHappyPath,TestRunner_RedisHappyPath,TestRunner_UnsupportedType_MarksFailed,TestDumpForResourceType_Dispatch,TestBackupSupportedResourceType.Dump cmd:
TestRealMongoDumpRunner_ConfigFileKeepsURIOutOfArgv,TestRealRedisDumpRunner_PasswordViaEnv,TestRealMongoRestoreRunner_DropAndArchive,TestSplitRedisURL,TestWriteMongoConfig.Restore:
TestRestoreForResourceType_Dispatch,TestRestoreRunner_MongoHappyPath,TestRestoreRunner_RedisUnsupported_MarksFailed.Gate:
go build ./... && go vet ./... && go test ./... -short -count=1green;real gofmt clean (touched files); golangci-lint 0 issues.
Operator follow-ups (separate
infrarepo — no auto-apply)instant_customer_backup_by_type_total: Prom rulein
prometheus-rules.yaml(per-type success-ratio < threshold, P1 — datadurability), NR alert JSON, dashboard tile (stacked by
resource_type), anda
METRICS-CATALOG.mdrow.replication; infra/operator, not worker code (out of scope per the brief).
backup+restore and flag Redis as backup-only until Redis restore ships.
🤖 Generated with Claude Code