Skip to content

fix(stack): atomic env-var merge — no lost updates under concurrency (bug-bash #10)#235

Merged
mastermanas805 merged 5 commits into
masterfrom
fix/stack-atomic-env-merge-bugbash-10
Jun 3, 2026
Merged

fix(stack): atomic env-var merge — no lost updates under concurrency (bug-bash #10)#235
mastermanas805 merged 5 commits into
masterfrom
fix/stack-atomic-env-merge-bugbash-10

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

Summary

PATCH /stacks/:slug/env did a non-atomic read-modify-write: GetStackEnvVars (load) → merge the delta in Go → UpdateStackEnvVars (blind full overwrite). Two concurrent PATCHes — PATCH{A=1} and PATCH{B=2} — both read {} and one overwrote the other, silently dropping a key.

This PR replaces that sequence with a single atomic models.MergeStackEnvVars: one transaction that takes a SELECT ... FOR UPDATE row lock, merges inside the tx, checks the 64KiB cap before the write, and commits. Concurrent PATCHes now serialize — the second blocks on the lock and reads the first's committed result instead of a stale snapshot.

  • Handler UpdateEnv now calls MergeStackEnvVars once. Error mapping preserved: ErrStackEnvVarsTooLarge→413 env_too_large, ErrStackNotFound→404 not_found, else 503 persist_failed. Audit keys_set = len(body.Env) - deletes math preserved (a delete is counted only when the key was present, matching the old loop's effective behavior on the merged set).
  • GetStackEnvVars + UpdateStackEnvVars are kept (other call sites: promote, custom-domain, log streaming, redeploy).

Coverage block

Symptom:        PATCH /stacks/:slug/env loses a key when two PATCHes race
                (non-atomic load→merge→overwrite; lost update).
Enumeration:    rg -F 'models.UpdateStackEnvVars' / 'GetStackEnvVars'  → single mutate site (UpdateEnv);
                other GetStackEnvVars readers (promote x2, redeploy) are read-only, not in the race.
Sites found:    1  (handler UpdateEnv mutate path)
Sites touched:  1  (rewired to one atomic MergeStackEnvVars call)
Coverage test:  models.TestMergeStackEnvVars_ConcurrentPatchesNoLostUpdate — 16 concurrent
                disjoint-key merges vs a real seeded row; asserts all 16 survive (fails on the
                old non-atomic path, passes under -race). Plus models.TestMergeStackEnvVars_Branches
                (sqlmock, 100% of merge arms: begin/select/unmarshal/too-large-rollback/update/
                commit/happy + present-only delete counting).
Live verified:  DB-gated tests run in CI (TEST_DATABASE_URL). Locally proven against
                postgres://localhost:5432/instant_dev_test with -race: merge tests green;
                full UpdateEnv + stack mid-handler-503 fault suite green.

Test-harness note (the fault-injection gotcha)

The atomic merge collapses the old two DB-error surfaces (fetch_failed for the read, persist_failed for the write) into one persist_failed surface. I chose option (b) — repurpose the two stack-env fault tests rather than rewire the shared faultConn — because it is lower blast radius (no change to the harness that ~15 other fault tests share).

Notably, the merge's tx-internal SELECT ... FOR UPDATE and UPDATE do still fault through the shared faultConn: lib/pq's driver.Tx implements neither QueryerContext nor ExecerContext, so database/sql routes sql.Tx.QueryContext/ExecContext back through the conn-level faultConn methods, which honor the failAfter budget. So:

  • TestStackFinal2_UpdateEnv_MergeSelectFailed (failAfter=2) — faults the tx SELECT → persist_failed.
  • TestStackFinal2_UpdateEnv_PersistFailed (failAfter=3) — faults the tx UPDATE → persist_failed.
  • TestStackFinal2_UpdateEnv_MergeRowVanished_404 (new) — a purpose-built local forUpdateVanish driver returns zero rows only for the FOR UPDATE query, so GetStackBySlug finds the row but the merge sees none → ErrStackNotFound → 404. Covers the handler's NotFound arm (a TOCTOU that can't be timed against a live row).
  • TestStackUpdateEnv_MidHandler503 (sweep) and all other stack mid-handler-503 fault tests stay green.

I added an accurate comment to faultConn.BeginTx documenting why in-tx queries remain fault-injectable.

🤖 Generated with Claude Code

…(bug-bash #10)

PATCH /stacks/:slug/env did a non-atomic read-modify-write
(GetStackEnvVars → merge-in-Go → UpdateStackEnvVars). Two concurrent
PATCHes with disjoint keys both read the same snapshot and the second
blind-overwrote the first, silently dropping a key.

Add models.MergeStackEnvVars: one transaction that takes a SELECT ...
FOR UPDATE row lock, merges the PATCH delta inside the tx, checks the
64KiB cap before the write, and commits. Concurrent PATCHes now
serialize — the second reads the first's committed result. The handler
calls it once; error mapping is preserved (TooLarge→413, NotFound→404,
else 503). GetStackEnvVars/UpdateStackEnvVars are kept (other callers).

Tests:
- models.TestMergeStackEnvVars_ConcurrentPatchesNoLostUpdate: 16
  concurrent disjoint-key merges against a real seeded row; asserts all
  16 keys survive (the lost-update regression; passes under -race).
- models.TestMergeStackEnvVars_Branches: sqlmock coverage of every arm
  (begin/select/unmarshal/too-large-rollback/update/commit/happy +
  present-only delete counting). 100% func coverage.
- The two stack-env fault tests are repurposed: the atomic merge
  collapses the old fetch_failed/persist_failed split into ONE
  persist_failed surface (the merge's tx-internal SELECT ... FOR UPDATE
  and UPDATE still fault through the shared faultConn, since lib/pq's
  driver.Tx has no QueryerContext and falls back to the conn). A new
  forUpdateVanish driver covers the handler's merge-NotFound→404 arm.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 3, 2026 19:46
@mastermanas805 mastermanas805 merged commit 43bef6b into master Jun 3, 2026
18 checks passed
mastermanas805 added a commit that referenced this pull request Jun 4, 2026
…eout, honest keys_set (panel #7) (#238)

Design-review follow-up to #235. Three hardenings to the atomic env merge:

1. In-tx teardown guard (closes a TOCTOU). The "stack is mid-teardown"
   check was a pre-read in the handler (stack.Status == deleting before
   MergeStackEnvVars) — racy: the teardown worker can flip status between
   GetStackBySlug and the merge, committing an env change onto a deleting
   stack. The SELECT now fetches status under the FOR UPDATE lock and
   returns models.ErrStackDeleting (→ 409 stack_deleting); the handler
   pre-read is removed so the in-tx check is authoritative. Same posture as
   the resource-DELETE / SetTTL hardenings this session.

2. SET LOCAL lock_timeout = '3s' as the first tx statement, so a PATCH that
   races a long-held row lock fails fast to a retryable 503 instead of
   hanging the request goroutine indefinitely.

3. keys_set audit count now counts only actual upserts (v != ""). The old
   `len(body.Env) - deletes` over-counted a no-op delete (empty value for an
   absent key: not a delete, not a set), making the rule-12 audit lie.

Tests: TestMergeStackEnvVars_Branches rewritten for the new SQL (SET LOCAL
exec + status,env_vars SELECT) and adds the lock_timeout-error and
status='deleting'→ErrStackDeleting cases — MergeStackEnvVars stays 100%.
The existing real-DB deleting handler tests now exercise the in-tx path
deterministically. The two stack-env fault tests' failAfter bumped +1 for
the added in-tx SET LOCAL exec (which the fault harness counts). Verified
vs real Postgres+Redis; all changed lines covered.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant