fix(customer-backup): recovery UPDATE NULL started_at + mongo/redis auth misclassification (P1)#106
Merged
Conversation
…tarted_at + classify mongo/redis auth failures
BUG 1 (P1 — recovery wedged + log flood): recoverStuckRows ran
`UPDATE resource_backups SET status='pending', started_at=NULL ...`
but resource_backups.started_at is `TIMESTAMPTZ NOT NULL` (api mig
031_backups.sql). The explicit NULL bind violated the constraint on
EVERY 30-60s tick, so stuck-row recovery NEVER worked and the worker
log flooded with `stuck_row_recovery_failed: pq: null value in column
"started_at" ... violates not-null constraint`. Drop the started_at
clause: leaving the stale value is correct because the re-claim in
processBackup runs `SET started_at=now()`, and the WHERE floor only
re-matches rows that have been 'running' again past the timeout. The
old regex-matcher test stayed green because sqlmock doesn't enforce
NOT NULL — the new test inspects the literal SQL and reds on NULL.
BUG 2 (misclassification surfaced by the 2026-06-11 mongo backup P1):
backupFailReason only knew Postgres auth strings, so mongodump
("auth error: ... SCRAM-SHA-256") and redis-cli ("WRONGPASS",
"NOAUTH") credential failures were bucketed as transient "dump" —
telling the customer "briefly unreachable, we'll retry" for a
non-self-healing credential failure and NOT paging ops. Extend the
classifier to match all three backends' auth dialects.
Note: the customer's actual mongo backup failure is an underlying
DATA/INFRA incident (the shared mongo pod OOMKilled and lost
db_poold8ad5fc1 + its user) — operator action required; this commit
fixes the two CODE defects (broken recovery + mislabel) the incident
exposed.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
P1: customer backups failing systemically — two code defects fixed
A customer got a "Backup failed — We couldn't read your database" email for their mongodb resource (backup
3feeaab1-280e-469d-af3b-c88940891d22). Investigation surfaced two distinct code bugs plus an underlying infra/data incident (operator action — see bottom).BUG 1 —
recoverStuckRowsbinds NULL to a NOT-NULL column (recovery wedged + log flood)recoverStuckRowsranUPDATE resource_backups SET status='pending', started_at=NULL ..., butresource_backups.started_atisTIMESTAMPTZ NOT NULL(api migration031_backups.sql). The explicit NULL bind overrides the DEFAULT and violates the constraint on every 30-60s tick. Live log (in-cluster worker, commit6692610):Effect: stuck-row recovery never worked (orphaned
runningrows are unreachable forever) and the worker log flooded.Does this wedge all customer backups? No.
Work()callsrecoverStuckRows(ctx)and ignores its result, then proceeds to the pending sweep — so the crash logs and continues; it does not abort the tick. Theinstant_customer_backup_succeeded_total=0reported on one pod was a per-pod metric artifact: a second running worker pod showsinstant_customer_backup_succeeded_total 1andby_type{postgres,ok}=1. Customer backups are succeeding. BUG 1 is real (recovery is 100% broken + log flood) but is not the systemic cause.Fix: drop the
started_at = NULLclause. Leaving the stale value is correct — the re-claim inprocessBackuprunsSET started_at = now(), and the recoveryWHEREfloor (started_at < now() - timeout) only re-matches rows that have beenrunningagain past the timeout.BUG 2 — auth failures from mongo/redis misclassified as transient
dumpbackupFailReasononly recognized Postgres auth strings, so the real mongo/redis errors were bucketeddump("briefly unreachable, we'll retry") instead ofauth(SLA-relevant, pages ops). Live internal_detail:The customer's failure email said "couldn't read your database (briefly unreachable), we'll retry" — wrong: a credential failure never self-heals on retry. Fix: classifier now matches all three backends' auth dialects (mongo
auth error/unable to authenticate/sasl; rediswrongpass/noauth/invalid username-password).Live evidence (
resource_backupsquery)24h aggregate: postgres 44 config / 22 auth / 5 ok / 16 retained-ok; redis 26
dump(=auth); mongodb 9dump(=auth).Rule-17 coverage block
Tests
TestRecoverStuckRows_NeverBindsNullStartedAt— sqlmockQueryMatcherFunccaptures the recovery SQL, fails if it setsstarted_at = NULL(sqlmock doesn't enforce NOT NULL, which is why the prior regex test missed this). Verified: FAIL on buggy code, PASS on fix.TestBackupFailReason— added the exact prod mongo SCRAM + redis WRONGPASS/NOAUTH stderr (→auth) plus transient mongo/redis cases (→dump). Verified: mongo/redis auth cases FAIL on old classifier, PASS on fix.go build ./...+go vet ./...clean.go test ./internal/jobs/ -shortgreen (97.7% pkg). Changed funcsrecoverStuckRows+backupFailReasonat 100%.The customer's mongo backup failure is a durability incident, not a backup-code bug. The shared
mongodbpod ininstant-datawas OOMKilled (exitCode:137, restarted2026-06-10 23:49) and lost the customer's database: as root,db_poold8ad5fc1no longer exists andgetUsers()is empty (onlydb_pool_6b266096-...survived). Same class for the 26 redisWRONGPASSfailures (shared redis pod restart dropped ACL users). The backups correctly cannot read databases/users that no longer exist.Operator must: (1) treat the shared mongo + redis data loss as a P0 durability incident (matches memory
project_infra_durability_exposure_2026_06_10.md— Mongo/Redis unbacked); (2) restore/re-provision the affected pooled mongo/redis tenants; (3) put the shared mongo/redis on durable storage + backups. After this PR, such credential failures will correctly classifyauthand page ops instead of silently retrying.🤖 Generated with Claude Code