fix(orphan-sweep): key stack-namespace liveness by slug not UUID id (P0 — every stack reaped)#105
Merged
Merged
Conversation
…P0 — every stack reaped)
PASS 5 of the orphan-sweep reconciler deletes any instant-stack-<X>
namespace whose <X> is not in the live-stack set. The namespace is
"instant-stack-{slug}" (api migration 004: slug TEXT UNIQUE NOT NULL,
a SEPARATE column from the table's UUID id PK; live prod evidence
namespace instant-stack-stk-4abb338c). But fetchLiveStackIDs queried
`SELECT id::text FROM stacks`, returning the UUID id set. The namespace
token is the slug, so liveStackIDs[slug] was ALWAYS false → every live
stack namespace was judged orphaned and DeleteNamespace'd on the next
15-min sweep. There is no age grace in PASS 5, so every multi-service
stack (anonymous AND paid /stacks/new) was torn down within ~minutes of
creation. The DB row survived, so the status API kept reporting healthy.
Fix: fetchLiveStackIDs → fetchLiveStackSlugs, querying `SELECT slug FROM
stacks` keyset-paginated on slug (idx_stacks_slug). The membership set is
now keyed by the token the namespace actually carries. All safety
properties intact: fail-open on list error, skip-pass on DB error,
per-namespace failure isolation, metrics/audit emissions.
Regression test: seed id != slug (the reality the old tests elided by
using id==slug); a live row's slug-named namespace is now recognized and
KEPT, while a slug with no row IS deleted. Reds on the old id-keyed query
(both namespaces deleted), greens on the fix. Corrected the existing
TestOrphanSweep_Pass5_ReclaimsOrphanedStackNamespace which had encoded
the bug by seeding id==slug. Added empty-slug guard coverage.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
P0 — every multi-service stack's k8s runtime silently deleted within ~15 min
PASS 5 of the orphan-sweep reconciler (
sweepOrphanedStackNamespaces) deletes anyinstant-stack-<X>namespace whose<X>is not in the live-stack set returned by the oldfetchLiveStackIDs. The namespace isinstant-stack-{slug}(api migration 004:slug TEXT UNIQUE NOT NULL -- short ID used in namespace, URLs;slugis a SEPARATE column from the table's UUIDidPK). ButfetchLiveStackIDsranSELECT id::text FROM stacks→ the UUID set. The namespace token is the slug, soliveStackIDs[slug]was ALWAYS false → every live stack namespace was judged orphaned andDeleteNamespace'd on the next 15-min sweep. PASS 5 has no age grace (unlike PASS 3), so every stack — anonymous AND paid/stacks/new— was torn down within minutes of creation. ThestacksDB row survived, so the status API kept reporting "healthy".The fix
fetchLiveStackIDs→fetchLiveStackSlugs, query changed toSELECT slug FROM stackskeyset-paginated onslug(indexidx_stacks_slug, migration 004). The liveness set is now keyed by the token the namespace actually carries. Every safety property is intact: fail-open on list error, skip-the-pass on DB error (never delete off an empty/errored set), per-namespace failure isolation, metrics/audit emissions. Behavior change is ONLY: the membership set is now correctly keyed by slug so live stacks are recognized and NOT deleted.Coverage block (rule 17)
Local gate
go build ./...clean,go vet ./...cleango test ./... -short -count=1— all 16 packages green (root pkg incl., no graceful-shutdown flake)internal/jobscoverage 97.7%;fetchLiveStackSlugs100%; every changed executable production line verified covered by the full suite (manual per-line check vs coverprofile — diff-cover not installed locally; CI is authoritative)id::textand running the regression test (output showed both namespaces deleted), then restoring the fix.Follow-ups (NOT fixed here — separate)
api/internal/handlers/stack.goGet()reports "healthy" from the DB row without reconciling live k8s, so a torn-down stack still reads healthy. Needs a live-k8s reconcile (or a worker-driven status column) so the DB and cluster cannot diverge.orphanNoDBRowGrace). Consider an age grace so an in-flight stack provision whosestacksINSERT is momentarily behind the namespace create cannot be reaped on a race.🤖 Generated with Claude Code