Skip to content

fix(orphan-sweep): key stack-namespace liveness by slug not UUID id (P0 — every stack reaped)#105

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/stack-orphan-sweep-slug-key
Jun 10, 2026
Merged

fix(orphan-sweep): key stack-namespace liveness by slug not UUID id (P0 — every stack reaped)#105
mastermanas805 merged 1 commit into
masterfrom
fix/stack-orphan-sweep-slug-key

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

P0 — every multi-service stack's k8s runtime silently deleted within ~15 min

PASS 5 of the orphan-sweep reconciler (sweepOrphanedStackNamespaces) deletes any instant-stack-<X> namespace whose <X> is not in the live-stack set returned by the old fetchLiveStackIDs. The namespace is instant-stack-{slug} (api migration 004: slug TEXT UNIQUE NOT NULL -- short ID used in namespace, URLs; slug is a SEPARATE column from the table's UUID id PK). But fetchLiveStackIDs ran SELECT id::text FROM stacks → the UUID set. The namespace token is the slug, so liveStackIDs[slug] was ALWAYS false → every live stack namespace was judged orphaned and DeleteNamespace'd on the next 15-min sweep. PASS 5 has no age grace (unlike PASS 3), so every stack — anonymous AND paid /stacks/new — was torn down within minutes of creation. The stacks DB row survived, so the status API kept reporting "healthy".

The fix

fetchLiveStackIDsfetchLiveStackSlugs, query changed to SELECT slug FROM stacks keyset-paginated on slug (index idx_stacks_slug, migration 004). The liveness set is now keyed by the token the namespace actually carries. Every safety property is intact: fail-open on list error, skip-the-pass on DB error (never delete off an empty/errored set), per-namespace failure isolation, metrics/audit emissions. Behavior change is ONLY: the membership set is now correctly keyed by slug so live stacks are recognized and NOT deleted.

Coverage block (rule 17)

Symptom:        every instant-stack-{slug} namespace deleted ~minutes after creation
                (liveStackIDs[slug] always false: slug != UUID id)
Enumeration:    rg -n "fetchLiveStackIDs|liveStackIDs|ExpireStacksNamespacePrefix|instant-stack|ListStackNamespaces|id::text FROM stacks"
Sites found:    1 production bug site (fetchLiveStackIDs query + PASS 5 loop key),
                + 5 test call/regex sites that referenced the old name/query
Sites touched:  M == N. Production: orphan_sweep_reconciler.go (query → slug,
                rename fn+locals, fixed "Stack IDs are UUIDs" comment + SAFETY/
                FAIL-OPEN/batch-limit comments). metrics.go PASS-5 comment.
                Tests: orphan_sweep_reconciler_test.go (regression test rewritten
                to id != slug + new empty-slug guard test); deploy_lifecycle_
                coverage_test.go (4 fetchLiveStackSlugs call sites + query regexes).
                ExpireStacksNamespacePrefix == "instant-stack-" verified (unchanged).
                ExpireStacksWorker / expire_stacks.go selects id,slug,namespace and
                uses the DB namespace directly — NOT slug-keyed — out of scope.
Coverage test:  TestOrphanSweep_Pass5_ReclaimsOrphanedStackNamespace — seeds a row
                with id (UUID) != slug ("stk-abc123"), namespace instant-stack-stk-
                abc123. REDS on old id-keyed query (failure output shows BOTH the
                live and orphan namespaces deleted), GREENS on the slug-keyed fix.
                The old version of this test had ENCODED the bug by seeding id==slug.
                Plus TestOrphanSweep_Pass5_EmptySlugNamespaceSkipped (empty-slug guard).
Live verified:  prod observation — a real namespace was instant-stack-stk-4abb338c
                (slug "stk-4abb338c", NOT a UUID); the id-keyed set never contained
                "stk-4abb338c", so the live stack was reaped.

Local gate

  • go build ./... clean, go vet ./... clean
  • go test ./... -short -count=1 — all 16 packages green (root pkg incl., no graceful-shutdown flake)
  • internal/jobs coverage 97.7%; fetchLiveStackSlugs 100%; every changed executable production line verified covered by the full suite (manual per-line check vs coverprofile — diff-cover not installed locally; CI is authoritative)
  • Red-on-old / green-on-new proven by temporarily reverting the query to id::text and running the regression test (output showed both namespaces deleted), then restoring the fix.

Follow-ups (NOT fixed here — separate)

  1. Status-lies-healthy. api/internal/handlers/stack.go Get() reports "healthy" from the DB row without reconciling live k8s, so a torn-down stack still reads healthy. Needs a live-k8s reconcile (or a worker-driven status column) so the DB and cluster cannot diverge.
  2. PASS 5 age grace. PASS 5 has no grace window (unlike PASS 3's orphanNoDBRowGrace). Consider an age grace so an in-flight stack provision whose stacks INSERT is momentarily behind the namespace create cannot be reaped on a race.

🤖 Generated with Claude Code

…P0 — every stack reaped)

PASS 5 of the orphan-sweep reconciler deletes any instant-stack-<X>
namespace whose <X> is not in the live-stack set. The namespace is
"instant-stack-{slug}" (api migration 004: slug TEXT UNIQUE NOT NULL,
a SEPARATE column from the table's UUID id PK; live prod evidence
namespace instant-stack-stk-4abb338c). But fetchLiveStackIDs queried
`SELECT id::text FROM stacks`, returning the UUID id set. The namespace
token is the slug, so liveStackIDs[slug] was ALWAYS false → every live
stack namespace was judged orphaned and DeleteNamespace'd on the next
15-min sweep. There is no age grace in PASS 5, so every multi-service
stack (anonymous AND paid /stacks/new) was torn down within ~minutes of
creation. The DB row survived, so the status API kept reporting healthy.

Fix: fetchLiveStackIDs → fetchLiveStackSlugs, querying `SELECT slug FROM
stacks` keyset-paginated on slug (idx_stacks_slug). The membership set is
now keyed by the token the namespace actually carries. All safety
properties intact: fail-open on list error, skip-pass on DB error,
per-namespace failure isolation, metrics/audit emissions.

Regression test: seed id != slug (the reality the old tests elided by
using id==slug); a live row's slug-named namespace is now recognized and
KEPT, while a slug with no row IS deleted. Reds on the old id-keyed query
(both namespaces deleted), greens on the fix. Corrected the existing
TestOrphanSweep_Pass5_ReclaimsOrphanedStackNamespace which had encoded
the bug by seeding id==slug. Added empty-slug guard coverage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 10, 2026 20:09
@mastermanas805 mastermanas805 merged commit 6692610 into master Jun 10, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant