fix(stack): atomic env-var merge — no lost updates under concurrency (bug-bash #10)#235
Merged
Merged
Conversation
…(bug-bash #10) PATCH /stacks/:slug/env did a non-atomic read-modify-write (GetStackEnvVars → merge-in-Go → UpdateStackEnvVars). Two concurrent PATCHes with disjoint keys both read the same snapshot and the second blind-overwrote the first, silently dropping a key. Add models.MergeStackEnvVars: one transaction that takes a SELECT ... FOR UPDATE row lock, merges the PATCH delta inside the tx, checks the 64KiB cap before the write, and commits. Concurrent PATCHes now serialize — the second reads the first's committed result. The handler calls it once; error mapping is preserved (TooLarge→413, NotFound→404, else 503). GetStackEnvVars/UpdateStackEnvVars are kept (other callers). Tests: - models.TestMergeStackEnvVars_ConcurrentPatchesNoLostUpdate: 16 concurrent disjoint-key merges against a real seeded row; asserts all 16 keys survive (the lost-update regression; passes under -race). - models.TestMergeStackEnvVars_Branches: sqlmock coverage of every arm (begin/select/unmarshal/too-large-rollback/update/commit/happy + present-only delete counting). 100% func coverage. - The two stack-env fault tests are repurposed: the atomic merge collapses the old fetch_failed/persist_failed split into ONE persist_failed surface (the merge's tx-internal SELECT ... FOR UPDATE and UPDATE still fault through the shared faultConn, since lib/pq's driver.Tx has no QueryerContext and falls back to the conn). A new forUpdateVanish driver covers the handler's merge-NotFound→404 arm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
Jun 4, 2026
…eout, honest keys_set (panel #7) (#238) Design-review follow-up to #235. Three hardenings to the atomic env merge: 1. In-tx teardown guard (closes a TOCTOU). The "stack is mid-teardown" check was a pre-read in the handler (stack.Status == deleting before MergeStackEnvVars) — racy: the teardown worker can flip status between GetStackBySlug and the merge, committing an env change onto a deleting stack. The SELECT now fetches status under the FOR UPDATE lock and returns models.ErrStackDeleting (→ 409 stack_deleting); the handler pre-read is removed so the in-tx check is authoritative. Same posture as the resource-DELETE / SetTTL hardenings this session. 2. SET LOCAL lock_timeout = '3s' as the first tx statement, so a PATCH that races a long-held row lock fails fast to a retryable 503 instead of hanging the request goroutine indefinitely. 3. keys_set audit count now counts only actual upserts (v != ""). The old `len(body.Env) - deletes` over-counted a no-op delete (empty value for an absent key: not a delete, not a set), making the rule-12 audit lie. Tests: TestMergeStackEnvVars_Branches rewritten for the new SQL (SET LOCAL exec + status,env_vars SELECT) and adds the lock_timeout-error and status='deleting'→ErrStackDeleting cases — MergeStackEnvVars stays 100%. The existing real-DB deleting handler tests now exercise the in-tx path deterministically. The two stack-env fault tests' failAfter bumped +1 for the added in-tx SET LOCAL exec (which the fault harness counts). Verified vs real Postgres+Redis; all changed lines covered. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PATCH /stacks/:slug/envdid a non-atomic read-modify-write:GetStackEnvVars(load) → merge the delta in Go →UpdateStackEnvVars(blind full overwrite). Two concurrent PATCHes —PATCH{A=1}andPATCH{B=2}— both read{}and one overwrote the other, silently dropping a key.This PR replaces that sequence with a single atomic
models.MergeStackEnvVars: one transaction that takes aSELECT ... FOR UPDATErow lock, merges inside the tx, checks the 64KiB cap before the write, and commits. Concurrent PATCHes now serialize — the second blocks on the lock and reads the first's committed result instead of a stale snapshot.UpdateEnvnow callsMergeStackEnvVarsonce. Error mapping preserved:ErrStackEnvVarsTooLarge→413env_too_large,ErrStackNotFound→404not_found, else 503persist_failed. Auditkeys_set = len(body.Env) - deletesmath preserved (a delete is counted only when the key was present, matching the old loop's effective behavior on the merged set).GetStackEnvVars+UpdateStackEnvVarsare kept (other call sites: promote, custom-domain, log streaming, redeploy).Coverage block
Test-harness note (the fault-injection gotcha)
The atomic merge collapses the old two DB-error surfaces (
fetch_failedfor the read,persist_failedfor the write) into onepersist_failedsurface. I chose option (b) — repurpose the two stack-env fault tests rather than rewire the sharedfaultConn— because it is lower blast radius (no change to the harness that ~15 other fault tests share).Notably, the merge's tx-internal
SELECT ... FOR UPDATEandUPDATEdo still fault through the sharedfaultConn: lib/pq'sdriver.Tximplements neitherQueryerContextnorExecerContext, sodatabase/sqlroutessql.Tx.QueryContext/ExecContextback through the conn-levelfaultConnmethods, which honor thefailAfterbudget. So:TestStackFinal2_UpdateEnv_MergeSelectFailed(failAfter=2) — faults the tx SELECT →persist_failed.TestStackFinal2_UpdateEnv_PersistFailed(failAfter=3) — faults the tx UPDATE →persist_failed.TestStackFinal2_UpdateEnv_MergeRowVanished_404(new) — a purpose-built localforUpdateVanishdriver returns zero rows only for theFOR UPDATEquery, soGetStackBySlugfinds the row but the merge sees none →ErrStackNotFound→ 404. Covers the handler's NotFound arm (a TOCTOU that can't be timed against a live row).TestStackUpdateEnv_MidHandler503(sweep) and all other stack mid-handler-503 fault tests stay green.I added an accurate comment to
faultConn.BeginTxdocumenting why in-tx queries remain fault-injectable.🤖 Generated with Claude Code