Skip to content

fix(deploy): guarded CAS on redeploy 'building' flip — close TOCTOU (#14)#241

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/redeploy-building-cas-2026-06-04
Jun 4, 2026
Merged

fix(deploy): guarded CAS on redeploy 'building' flip — close TOCTOU (#14)#241
mastermanas805 merged 1 commit into
masterfrom
fix/redeploy-building-cas-2026-06-04

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What

Closes sweep finding #14 (P3, TOCTOU): redeploy unconditionally flips an expired/deleted deploy back to building.

Both redeploy entry points — POST /deploy/:id/redeploy and POST /deploy/new (redeploy=true) — read the deployment row, assert it's non-terminal (IsDeploymentTerminal / FindActiveDeploymentByTeamEnvName), then flip it to building. Between the read and the flip, the DeploymentExpirer / teardown reconciler can reap the row to expired/deleted. The old unconditional UpdateDeploymentStatus would resurrect that reaped, over-TTL / over-cap workload back to building.

Fix

New models.MarkDeploymentBuilding performs the flip as a guarded CAS:

UPDATE deployments SET status='building', error_message=NULL, updated_at=now()
WHERE id=$1 AND status IN ('building','deploying','healthy','failed')

returning rows-affected — mirroring the existing CAS guards on MarkDeploymentTornDown and SetDeploymentTTL. Both handlers now branch on the result:

  • 0 rows409 deployment_not_redeployable (reaped concurrently — do not re-arm)
  • driver error → log + continue (non-determinate; runRedeployAsync reconciles status later)
  • 1 row → proceed to async build

New metric label: instant_deploy_redeploy_total{outcome="not_redeployable"}.

Coverage block

Symptom:        redeploy resurrects an expired/deleted deploy back to 'building' (TOCTOU)
Enumeration:    grep "building" status flips in deploy.go + UpdateDeploymentStatus call sites
Sites found:    2  (in-place /deploy/new redeploy=true; POST /deploy/:id/redeploy)
Sites touched:  2
Coverage test:  TestMarkDeploymentBuilding_Branches (1-row / 0-row / driver),
                TestDeployNew_Redeploy_CASMiss_Returns409,
                TestDeployRedeploy_ByID_CASMiss_Returns409,
                TestDeployRedeploy_ByID_CASSuccess_Returns202,
                TestDeployRedeploy_ByID_CASDriverError_StillAccepts,
                TestDeployNew_Redeploy_UpdateStatusError_StillAccepts (updated)
Live verified:  awaiting post-merge auto-deploy (rule 14 build-SHA gate in CI)

100% patch coverage (every new branch in deploy.go + MarkDeploymentBuilding exercised). go vet clean; gofmt clean.

🤖 Generated with Claude Code

)

Both redeploy entry points (POST /deploy/:id/redeploy and POST /deploy/new
redeploy=true) read the deployment row, assert it is non-terminal, then flip
it to 'building'. Between the read and the flip the DeploymentExpirer /
teardown reconciler can reap the row to 'expired'/'deleted'. The old
unconditional UpdateDeploymentStatus would resurrect that reaped, over-TTL /
over-cap workload back to 'building'.

New models.MarkDeploymentBuilding does the flip as a guarded compare-and-swap
(WHERE status IN ('building','deploying','healthy','failed')) and returns
rows-affected — mirroring the CAS guards already on MarkDeploymentTornDown and
SetDeploymentTTL. Both handlers now: 0 rows -> 409 deployment_not_redeployable
(reaped concurrently, do not re-arm); driver error -> log + continue
(non-determinate; runRedeployAsync reconciles); 1 row -> proceed.

New metric label DeployRedeployInPlaceTotal{outcome="not_redeployable"}.

Coverage:
  Symptom:        redeploy resurrects an expired/deleted deploy to 'building'
  Enumeration:    rg -F 'UpdateDeploymentStatus(c.Context(), h.db' + "building" flips in deploy.go
  Sites found:    2  (in-place /deploy/new redeploy=true; POST /deploy/:id/redeploy)
  Sites touched:  2
  Coverage test:  TestMarkDeploymentBuilding_Branches (1-row/0-row/driver),
                  TestDeployNew_Redeploy_CASMiss_Returns409,
                  TestDeployRedeploy_ByID_CASMiss_Returns409,
                  TestDeployRedeploy_ByID_CASSuccess_Returns202,
                  TestDeployRedeploy_ByID_CASDriverError_StillAccepts,
                  TestDeployNew_Redeploy_UpdateStatusError_StillAccepts (updated)
  Live verified:  awaiting post-merge auto-deploy (rule 14 build-SHA gate in CI)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 4, 2026 15:20
@mastermanas805 mastermanas805 merged commit d0f05fa into master Jun 4, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant