Skip to content

fix(customer-backup): recovery UPDATE NULL started_at + mongo/redis auth misclassification (P1)#106

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/customer-backup-stuck-row-and-dump
Jun 11, 2026
Merged

fix(customer-backup): recovery UPDATE NULL started_at + mongo/redis auth misclassification (P1)#106
mastermanas805 merged 1 commit into
masterfrom
fix/customer-backup-stuck-row-and-dump

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

P1: customer backups failing systemically — two code defects fixed

A customer got a "Backup failed — We couldn't read your database" email for their mongodb resource (backup 3feeaab1-280e-469d-af3b-c88940891d22). Investigation surfaced two distinct code bugs plus an underlying infra/data incident (operator action — see bottom).

BUG 1 — recoverStuckRows binds NULL to a NOT-NULL column (recovery wedged + log flood)

recoverStuckRows ran UPDATE resource_backups SET status='pending', started_at=NULL ..., but resource_backups.started_at is TIMESTAMPTZ NOT NULL (api migration 031_backups.sql). The explicit NULL bind overrides the DEFAULT and violates the constraint on every 30-60s tick. Live log (in-cluster worker, commit 6692610):

jobs.customer_backup_runner.stuck_row_recovery_failed
error: pq: null value in column "started_at" of relation "resource_backups" violates not-null constraint

Effect: stuck-row recovery never worked (orphaned running rows are unreachable forever) and the worker log flooded.

Does this wedge all customer backups? No. Work() calls recoverStuckRows(ctx) and ignores its result, then proceeds to the pending sweep — so the crash logs and continues; it does not abort the tick. The instant_customer_backup_succeeded_total=0 reported on one pod was a per-pod metric artifact: a second running worker pod shows instant_customer_backup_succeeded_total 1 and by_type{postgres,ok}=1. Customer backups are succeeding. BUG 1 is real (recovery is 100% broken + log flood) but is not the systemic cause.

Fix: drop the started_at = NULL clause. Leaving the stale value is correct — the re-claim in processBackup runs SET started_at = now(), and the recovery WHERE floor (started_at < now() - timeout) only re-matches rows that have been running again past the timeout.

BUG 2 — auth failures from mongo/redis misclassified as transient dump

backupFailReason only recognized Postgres auth strings, so the real mongo/redis errors were bucketed dump ("briefly unreachable, we'll retry") instead of auth (SLA-relevant, pages ops). Live internal_detail:

mongodump: ... auth error: sasl conversation error: unable to authenticate using mechanism "SCRAM-SHA-256": ... socket was unexpectedly closed: EOF
redis-cli --rdb: AUTH failed: WRONGPASS invalid username-password pair or user is disabled.

The customer's failure email said "couldn't read your database (briefly unreachable), we'll retry" — wrong: a credential failure never self-heals on retry. Fix: classifier now matches all three backends' auth dialects (mongo auth error/unable to authenticate/sasl; redis wrongpass/noauth/invalid username-password).

Live evidence (resource_backups query)

id=3feeaab1-...891d22  resource=587afeb2-...c8528c  status=failed  kind=scheduled
error_summary="We couldn't read your database for this backup (it may have been briefly unreachable)..."
started_at 03:51:23.019  finished_at 03:51:23.777  (~0.76s — mongodump WAS reached, failed fast at auth)

24h aggregate: postgres 44 config / 22 auth / 5 ok / 16 retained-ok; redis 26 dump(=auth); mongodb 9 dump(=auth).


Rule-17 coverage block

Symptom:        recoverStuckRows NULL started_at crash + mongo/redis "couldn't read" mislabel
Enumeration:    rg started_at internal/jobs ; rg backupFailReason internal/jobs ; live worker logs + resource_backups query
Sites found:    BUG1 = 1 (recoverStuckRows UPDATE); BUG2 = 1 (backupFailReason)
Sites touched:  BUG1 = 1/1 ; BUG2 = 1/1 (only classifier; sanitizedBackupFailure copy already correct per reason)
Coverage test:  TestRecoverStuckRows_NeverBindsNullStartedAt (inspects literal SQL, reds on NULL bind);
                TestBackupFailReason/{prod_mongo_SCRAM_auth,prod_redis_WRONGPASS,redis_NOAUTH,...}
Live verified:  in-cluster worker logs + platform-DB resource_backups SELECT (above); both regression tests
                proven to RED on reverted code, GREEN on fix.

Tests

  • TestRecoverStuckRows_NeverBindsNullStartedAt — sqlmock QueryMatcherFunc captures the recovery SQL, fails if it sets started_at = NULL (sqlmock doesn't enforce NOT NULL, which is why the prior regex test missed this). Verified: FAIL on buggy code, PASS on fix.
  • TestBackupFailReason — added the exact prod mongo SCRAM + redis WRONGPASS/NOAUTH stderr (→auth) plus transient mongo/redis cases (→dump). Verified: mongo/redis auth cases FAIL on old classifier, PASS on fix.
  • go build ./... + go vet ./... clean. go test ./internal/jobs/ -short green (97.7% pkg). Changed funcs recoverStuckRows + backupFailReason at 100%.

⚠️ Operator action required (infra/data — NOT fixed by this code PR)

The customer's mongo backup failure is a durability incident, not a backup-code bug. The shared mongodb pod in instant-data was OOMKilled (exitCode:137, restarted 2026-06-10 23:49) and lost the customer's database: as root, db_poold8ad5fc1 no longer exists and getUsers() is empty (only db_pool_6b266096-... survived). Same class for the 26 redis WRONGPASS failures (shared redis pod restart dropped ACL users). The backups correctly cannot read databases/users that no longer exist.

Operator must: (1) treat the shared mongo + redis data loss as a P0 durability incident (matches memory project_infra_durability_exposure_2026_06_10.md — Mongo/Redis unbacked); (2) restore/re-provision the affected pooled mongo/redis tenants; (3) put the shared mongo/redis on durable storage + backups. After this PR, such credential failures will correctly classify auth and page ops instead of silently retrying.

🤖 Generated with Claude Code

…tarted_at + classify mongo/redis auth failures

BUG 1 (P1 — recovery wedged + log flood): recoverStuckRows ran
`UPDATE resource_backups SET status='pending', started_at=NULL ...`
but resource_backups.started_at is `TIMESTAMPTZ NOT NULL` (api mig
031_backups.sql). The explicit NULL bind violated the constraint on
EVERY 30-60s tick, so stuck-row recovery NEVER worked and the worker
log flooded with `stuck_row_recovery_failed: pq: null value in column
"started_at" ... violates not-null constraint`. Drop the started_at
clause: leaving the stale value is correct because the re-claim in
processBackup runs `SET started_at=now()`, and the WHERE floor only
re-matches rows that have been 'running' again past the timeout. The
old regex-matcher test stayed green because sqlmock doesn't enforce
NOT NULL — the new test inspects the literal SQL and reds on NULL.

BUG 2 (misclassification surfaced by the 2026-06-11 mongo backup P1):
backupFailReason only knew Postgres auth strings, so mongodump
("auth error: ... SCRAM-SHA-256") and redis-cli ("WRONGPASS",
"NOAUTH") credential failures were bucketed as transient "dump" —
telling the customer "briefly unreachable, we'll retry" for a
non-self-healing credential failure and NOT paging ops. Extend the
classifier to match all three backends' auth dialects.

Note: the customer's actual mongo backup failure is an underlying
DATA/INFRA incident (the shared mongo pod OOMKilled and lost
db_poold8ad5fc1 + its user) — operator action required; this commit
fixes the two CODE defects (broken recovery + mislabel) the incident
exposed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 11, 2026 04:54
@mastermanas805 mastermanas805 merged commit e7d23c6 into master Jun 11, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant