Skip to content

feat(worker): R2 — extend customer-backup ladder to mongodb + redis#104

Merged
mastermanas805 merged 2 commits into
masterfrom
feat/customer-backup-mongo-redis
Jun 10, 2026
Merged

feat(worker): R2 — extend customer-backup ladder to mongodb + redis#104
mastermanas805 merged 2 commits into
masterfrom
feat/customer-backup-mongo-redis

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What & why

The customer-backup ladder backed up postgres/vector ONLY
(customer_backup_scheduler.go enqueued resource_type IN ('postgres','vector')),
but the product sells "backups + 1-click restore" for ALL paid resources.
Mongo/Redis had ZERO automated backup — the durability gap flagged in the
worker #103 note + GAP-AUDIT-2026-06-10. This closes it by mirroring the
existing pg/vector path (same cadence / retention / tier-gating / S3 sink /
<1KB sanity guard, all via the shared pipeline) for the new types.

What's backed up now

Resource Backup Restore
postgres / vector ✅ (unchanged) ✅ (unchanged)
mongodb mongodump --archive → gzip → S3 mongorestore --archive --drop
redis redis-cli --rdb - → gzip → S3 ⏳ follow-up (backup-only)

Redis backup ships (taken, sha-verified, retained, downloadable). Redis
restore is the tracked follow-up: an RDB restore-in-place needs pod-level
access (replace dump.rdb + restart, or a per-key RESTORE pass) that the
worker can't drive from outside the pod. The restore runner marks a redis
restore row failed with that explicit reason before any S3 download.

Key design decisions

  • One pipeline, one object layout. Every dump strategy writes a RAW
    (uncompressed)
    archive into the runner's existing gzip writer — mongodump --archive (NOT --gzip, which would double-compress under the pipeline's
    gzip and break the gunzip→mongorestore symmetry), redis-cli --rdb -
    streams the RDB to stdout. Object key, sha256-of-gzip, retention, keep-last-N,
    and restore-gunzip logic are unchanged.
  • Secret hygiene (mirrors SEC-WORKER FINDING-2 PGPASSWORD on the pg path):
    the customer credential never sits in argv during the backup window. Mongo
    URI is passed via a 0600 mongodump --config file; Redis password via the
    REDISCLI_AUTH env. Both fail open to argv on a parse/temp-file error
    (strictly better than no backup), same posture the pg path documents.
  • Fail-open dispatch. A nil strategy or unsupported type → markFailed
    reason=config, never a panic. The scheduler SQL only enqueues supported
    types; the runner guard is defence-in-depth for a manual API backup.
  • Single source of truth for "what's backed up": backupSupportedResourceType()
    anchors both the scheduler SQL set and the runner dispatch (root rule 16/18).

Metrics (rule 25)

New instant_customer_backup_by_type_total{resource_type, result} — per-type
success/failure so "Mongo backing up but Redis silently failing" is visible
(the aggregate *_succeeded_total / *_failed_total counters are retained for
the existing dashboard + customer-backup-failed.json alert). Primed for all
(type,result) pairs in metrics_test.go so the lazy *Vec series render from
process start.

Dockerfile

apk add --no-cache mongodb-tools redispostgres:16-alpine ships pg_dump
only; without these the new exec calls would fail at runtime (the runner
fail-opens, but no Mongo/Redis backup would actually succeed).

Tests (all green)

Scheduler: TestScheduler_EnqueuesMongoAndRedisForProTier,
TestScheduler_FreeTierMongoNeverEnqueued,
TestScheduler_FreeMongoGate_SkipsEvenIfRowLeaksThrough,
TestScheduler_SQLFilterIncludesAllSupportedTypes.
Runner: TestRunner_MongoHappyPath, TestRunner_RedisHappyPath,
TestRunner_UnsupportedType_MarksFailed, TestDumpForResourceType_Dispatch,
TestBackupSupportedResourceType.
Dump cmd: TestRealMongoDumpRunner_ConfigFileKeepsURIOutOfArgv,
TestRealRedisDumpRunner_PasswordViaEnv, TestRealMongoRestoreRunner_DropAndArchive,
TestSplitRedisURL, TestWriteMongoConfig.
Restore: TestRestoreForResourceType_Dispatch, TestRestoreRunner_MongoHappyPath,
TestRestoreRunner_RedisUnsupported_MarksFailed.

Gate: go build ./... && go vet ./... && go test ./... -short -count=1 green;
real gofmt clean (touched files); golangci-lint 0 issues.

Operator follow-ups (separate infra repo — no auto-apply)

  1. Rule 25 monitoring for instant_customer_backup_by_type_total: Prom rule
    in prometheus-rules.yaml (per-type success-ratio < threshold, P1 — data
    durability), NR alert JSON, dashboard tile (stacked by resource_type), and
    a METRICS-CATALOG.md row.
  2. Object-storage (DO Spaces) bucket backup — bucket versioning / cross-region
    replication; infra/operator, not worker code (out of scope per the brief).
  3. Redis RESTORE — the tracked follow-up (see above).
  4. Durability COPY must read "Postgres / vector / mongo" for full
    backup+restore and flag Redis as backup-only until Redis restore ships.

🤖 Generated with Claude Code

mastermanas805 and others added 2 commits June 11, 2026 00:03
The backup ladder backed up postgres/vector ONLY, but the product sells
"backups + 1-click restore" for ALL paid resources — Mongo/Redis had
ZERO automated backup (worker #103 note + GAP-AUDIT-2026-06-10).

Mirrors the existing pg/vector path (cadence/retention/tier-gating/S3
sink/<1KB sanity via the shared pipeline) for the new types:

- backup_dump.go: mongoDumpRunner (`mongodump --archive`) + redisDumpRunner
  (`redis-cli --rdb -`) behind the same seam as pgDumpRunner. Both write a
  RAW archive into the runner's gzip pipeline (NOT mongodump --gzip — the
  pipeline owns compression, keeping one object layout + sha256 + restore
  story). Secret hygiene: mongo URI via a 0600 --config file, redis password
  via REDISCLI_AUTH env — credential never in argv (mirrors SEC-WORKER
  FINDING-2 PGPASSWORD on the pg path).
- customer_backup_runner.go: dispatch on resource_type; nil/unsupported is a
  fail-open 'config' failure, never a panic.
- customer_backup_scheduler.go: resource_type filter grows to
  ('postgres','vector','mongodb','redis'); cadence stays tier-driven so
  free/anonymous are still never enqueued.
- customer_restore_runner.go: mongorestore --archive --drop branch (the
  rewind analogue of pg_restore --clean --if-exists). Redis RESTORE is the
  tracked follow-up (RDB restore-in-place needs pod-level access the worker
  lacks) — redis backups are still taken, sha-verified, and downloadable.
- metrics: instant_customer_backup_by_type_total{resource_type,result} —
  per-type success/failure so "Mongo healthy, Redis silently failing" is
  visible (aggregate counters retained). Primed for all pairs in the test.
- Dockerfile: apk add mongodb-tools redis so the new tools exist at runtime
  (postgres:16-alpine ships pg_dump only).

Tests: scheduler enqueues mongo/redis for pro + never for free; runner
dispatches to the right dumper + increments the per-type metric; unsupported
type fails clean; mongo restore round-trips through gunzip; redis restore
marks failed (guard before download); dump runners keep secrets out of argv;
splitRedisURL + writeMongoConfig units. Gate green (build+vet+test -short),
gofmt clean, golangci 0 issues.

Operator follow-ups (separate infra repo, no auto-apply): NR alert + Prom
rule + dashboard tile + METRICS-CATALOG row for the new counter (rule 25);
object-storage bucket backup (DO Spaces versioning/replication); Redis
restore; durability COPY must say "Postgres/vector/mongo only" (Redis =
backup-only) until Redis restore ships.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The R2 mongo/redis backup+restore code left its failure arms untested,
failing the diff-cover gate at 84%:

- backup_dump.go: mongodump fail-open --uri-in-argv branch, mongodump
  exec-error return, all four writeMongoConfig error arms
  (CreateTemp/chmod/write/sync), redis-cli --tls arm (rediss://),
  redis-cli fail-open `-u <uri>` branch, redis-cli exec-error return.
- customer_restore_runner.go: mongorestore fail-open --uri branch and
  mongorestore exec-error return.

The chmod/write/sync failures cannot be forced against a healthy temp
file, so writeMongoConfig's fs ops now route through injectable
package-var seams (mongoCfgCreateTemp/Chmod/WriteURI/Sync) — same
pattern as txtLookupFunc / deployNotifyResolver. Production behavior
unchanged; tests swap + restore.

Exec-error returns are exercised via fake failing CLI scripts on PATH
(installFakeFailingBinary), mirroring installFakeBinary. Local profile
now shows zero uncovered blocks in backup_dump.go and both flagged
restore-runner ranges covered.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 10, 2026 18:55
@mastermanas805 mastermanas805 merged commit 80be49b into master Jun 10, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant