fix(stack): atomic env-var merge — no lost updates under concurrency (bug-bash #10) by mastermanas805 · Pull Request #235 · InstaNode-dev/api

mastermanas805 · 2026-06-03T19:46:31Z

Summary

PATCH /stacks/:slug/env did a non-atomic read-modify-write: GetStackEnvVars (load) → merge the delta in Go → UpdateStackEnvVars (blind full overwrite). Two concurrent PATCHes — PATCH{A=1} and PATCH{B=2} — both read {} and one overwrote the other, silently dropping a key.

This PR replaces that sequence with a single atomic models.MergeStackEnvVars: one transaction that takes a SELECT ... FOR UPDATE row lock, merges inside the tx, checks the 64KiB cap before the write, and commits. Concurrent PATCHes now serialize — the second blocks on the lock and reads the first's committed result instead of a stale snapshot.

Handler UpdateEnv now calls MergeStackEnvVars once. Error mapping preserved: ErrStackEnvVarsTooLarge→413 env_too_large, ErrStackNotFound→404 not_found, else 503 persist_failed. Audit keys_set = len(body.Env) - deletes math preserved (a delete is counted only when the key was present, matching the old loop's effective behavior on the merged set).
GetStackEnvVars + UpdateStackEnvVars are kept (other call sites: promote, custom-domain, log streaming, redeploy).

Coverage block

Symptom:        PATCH /stacks/:slug/env loses a key when two PATCHes race
                (non-atomic load→merge→overwrite; lost update).
Enumeration:    rg -F 'models.UpdateStackEnvVars' / 'GetStackEnvVars'  → single mutate site (UpdateEnv);
                other GetStackEnvVars readers (promote x2, redeploy) are read-only, not in the race.
Sites found:    1  (handler UpdateEnv mutate path)
Sites touched:  1  (rewired to one atomic MergeStackEnvVars call)
Coverage test:  models.TestMergeStackEnvVars_ConcurrentPatchesNoLostUpdate — 16 concurrent
                disjoint-key merges vs a real seeded row; asserts all 16 survive (fails on the
                old non-atomic path, passes under -race). Plus models.TestMergeStackEnvVars_Branches
                (sqlmock, 100% of merge arms: begin/select/unmarshal/too-large-rollback/update/
                commit/happy + present-only delete counting).
Live verified:  DB-gated tests run in CI (TEST_DATABASE_URL). Locally proven against
                postgres://localhost:5432/instant_dev_test with -race: merge tests green;
                full UpdateEnv + stack mid-handler-503 fault suite green.

Test-harness note (the fault-injection gotcha)

The atomic merge collapses the old two DB-error surfaces (fetch_failed for the read, persist_failed for the write) into one persist_failed surface. I chose option (b) — repurpose the two stack-env fault tests rather than rewire the shared faultConn — because it is lower blast radius (no change to the harness that ~15 other fault tests share).

Notably, the merge's tx-internal SELECT ... FOR UPDATE and UPDATE do still fault through the shared faultConn: lib/pq's driver.Tx implements neither QueryerContext nor ExecerContext, so database/sql routes sql.Tx.QueryContext/ExecContext back through the conn-level faultConn methods, which honor the failAfter budget. So:

TestStackFinal2_UpdateEnv_MergeSelectFailed (failAfter=2) — faults the tx SELECT → persist_failed.
TestStackFinal2_UpdateEnv_PersistFailed (failAfter=3) — faults the tx UPDATE → persist_failed.
TestStackFinal2_UpdateEnv_MergeRowVanished_404 (new) — a purpose-built local forUpdateVanish driver returns zero rows only for the FOR UPDATE query, so GetStackBySlug finds the row but the merge sees none → ErrStackNotFound → 404. Covers the handler's NotFound arm (a TOCTOU that can't be timed against a live row).
TestStackUpdateEnv_MidHandler503 (sweep) and all other stack mid-handler-503 fault tests stay green.

I added an accurate comment to faultConn.BeginTx documenting why in-tx queries remain fault-injectable.

🤖 Generated with Claude Code

…(bug-bash #10) PATCH /stacks/:slug/env did a non-atomic read-modify-write (GetStackEnvVars → merge-in-Go → UpdateStackEnvVars). Two concurrent PATCHes with disjoint keys both read the same snapshot and the second blind-overwrote the first, silently dropping a key. Add models.MergeStackEnvVars: one transaction that takes a SELECT ... FOR UPDATE row lock, merges the PATCH delta inside the tx, checks the 64KiB cap before the write, and commits. Concurrent PATCHes now serialize — the second reads the first's committed result. The handler calls it once; error mapping is preserved (TooLarge→413, NotFound→404, else 503). GetStackEnvVars/UpdateStackEnvVars are kept (other callers). Tests: - models.TestMergeStackEnvVars_ConcurrentPatchesNoLostUpdate: 16 concurrent disjoint-key merges against a real seeded row; asserts all 16 keys survive (the lost-update regression; passes under -race). - models.TestMergeStackEnvVars_Branches: sqlmock coverage of every arm (begin/select/unmarshal/too-large-rollback/update/commit/happy + present-only delete counting). 100% func coverage. - The two stack-env fault tests are repurposed: the atomic merge collapses the old fetch_failed/persist_failed split into ONE persist_failed surface (the merge's tx-internal SELECT ... FOR UPDATE and UPDATE still fault through the shared faultConn, since lib/pq's driver.Tx has no QueryerContext and falls back to the conn). A new forUpdateVanish driver covers the handler's merge-NotFound→404 arm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eout, honest keys_set (panel #7) (#238) Design-review follow-up to #235. Three hardenings to the atomic env merge: 1. In-tx teardown guard (closes a TOCTOU). The "stack is mid-teardown" check was a pre-read in the handler (stack.Status == deleting before MergeStackEnvVars) — racy: the teardown worker can flip status between GetStackBySlug and the merge, committing an env change onto a deleting stack. The SELECT now fetches status under the FOR UPDATE lock and returns models.ErrStackDeleting (→ 409 stack_deleting); the handler pre-read is removed so the in-tx check is authoritative. Same posture as the resource-DELETE / SetTTL hardenings this session. 2. SET LOCAL lock_timeout = '3s' as the first tx statement, so a PATCH that races a long-held row lock fails fast to a retryable 503 instead of hanging the request goroutine indefinitely. 3. keys_set audit count now counts only actual upserts (v != ""). The old `len(body.Env) - deletes` over-counted a no-op delete (empty value for an absent key: not a delete, not a set), making the rule-12 audit lie. Tests: TestMergeStackEnvVars_Branches rewritten for the new SQL (SET LOCAL exec + status,env_vars SELECT) and adds the lock_timeout-error and status='deleting'→ErrStackDeleting cases — MergeStackEnvVars stays 100%. The existing real-DB deleting handler tests now exercise the in-tx path deterministically. The two stack-env fault tests' failAfter bumped +1 for the added in-tx SET LOCAL exec (which the fault harness counts). Verified vs real Postgres+Redis; all changed lines covered. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mastermanas805 enabled auto-merge (squash) June 3, 2026 19:46

mastermanas805 added 4 commits June 4, 2026 01:27

Merge branch 'master' into fix/stack-atomic-env-merge-bugbash-10

d0eb82b

Merge branch 'master' into fix/stack-atomic-env-merge-bugbash-10

79b92ce

Merge branch 'master' into fix/stack-atomic-env-merge-bugbash-10

acf5686

Merge branch 'master' into fix/stack-atomic-env-merge-bugbash-10

d176106

mastermanas805 merged commit 43bef6b into master Jun 3, 2026
18 checks passed

mastermanas805 mentioned this pull request Jun 4, 2026

fix(stack): harden MergeStackEnvVars — in-tx deleting guard, lock_timeout, honest keys_set (panel #7) #238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(stack): atomic env-var merge — no lost updates under concurrency (bug-bash #10)#235

fix(stack): atomic env-var merge — no lost updates under concurrency (bug-bash #10)#235
mastermanas805 merged 5 commits into
masterfrom
fix/stack-atomic-env-merge-bugbash-10

mastermanas805 commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented Jun 3, 2026

Summary

Coverage block

Test-harness note (the fault-injection gotcha)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant