Skip to content

fix(noderesource): clear Status.StatefulSet when SyncStatefulSet deletes an impostor#348

Merged
bdchatham merged 1 commit into
mainfrom
fix/sts-impostor-clear-tracking
May 21, 2026
Merged

fix(noderesource): clear Status.StatefulSet when SyncStatefulSet deletes an impostor#348
bdchatham merged 1 commit into
mainfrom
fix/sts-impostor-clear-tracking

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

Closes a cross-reconcile churn loop in the impostor-recovery path introduced by #345. Cursor review caught it.

The bug

SyncStatefulSet's impostor branch deletes the live StatefulSet and returns (nil, nil), expecting the next reconcile to observe NotFound and Apply fresh. The controller wrapper sees nil and returns without updating Status.StatefulSet.

But reconcile N continues past the wrapper into ExecutePlan. The apply-statefulset task calls SyncStatefulSet again. Status.StatefulSet still tracks the stale UID, the live object is gone, the impostor check falls through (NotFound case), and Apply creates fresh with a new UID. End of reconcile N: live STS exists, but Status.StatefulSet still points at the original stale UID.

Reconcile N+1: wrapper sees stale UID ≠ new live UID, deletes the fresh STS as another impostor, the task creates another, loop. Brief pod outage on every cycle.

The fix

When SyncStatefulSet deletes an impostor, also clear node.Status.StatefulSet. Any subsequent SyncStatefulSet call — same reconcile or the next — routes through the adopt-on-observe path that already exists (Status.StatefulSet == nil → Get live STS → adopt → Apply idempotent → caller stamps Status). No misclassification, no loop.

~2 lines of code + a focused unit test.

Test plan

  • go build ./... clean
  • go test ./... — all unit tests pass
  • make test-integration — envtest 56s green
  • Extended TestSyncStatefulSet_ImpostorThenApply: asserts Status.StatefulSet == nil after the impostor delete
  • New TestSyncStatefulSet_ImpostorClearPreventsChurn: directly exercises the cross-reconcile scenario — impostor delete, fresh Apply within the same pass, then a third Sync call adopts rather than re-deleting

🤖 Generated with Claude Code

…tes an impostor

Cross-reconcile churn loop, caught by Cursor review of #345.

When SyncStatefulSet detects a UID mismatch between the tracked
Status.StatefulSet and the live object, it Deletes the live impostor
and returns (nil, nil) so the next reconcile re-Applies fresh. The
controller wrapper sees nil and returns without touching Status.

The bug: reconcile N continues past the impostor delete into
ExecutePlan, where the apply-statefulset task calls SyncStatefulSet
again. Status.StatefulSet still holds the stale UID, but the live
object is gone — so the call falls through the impostor check
(NotFound case) and Applies fresh with a new UID. Reconcile N ends
with the new live StatefulSet but Status.StatefulSet still pointing
at the original stale UID. Reconcile N+1 sees the mismatch, deletes
the fresh StatefulSet as another impostor, the task creates another
one, loop. Brief pod outage on every cycle.

Fix: when SyncStatefulSet deletes the impostor, also clear
node.Status.StatefulSet. Any subsequent SyncStatefulSet — in the same
reconcile or the next — routes through the adopt-on-observe path
instead of mis-classifying the fresh StatefulSet.

Coverage:
- TestSyncStatefulSet_ImpostorThenApply extended: asserts
  Status.StatefulSet is nil after reconcile 1's impostor delete.
- TestSyncStatefulSet_ImpostorClearPreventsChurn (new): directly
  exercises the cross-reconcile scenario — impostor delete in
  reconcile N, fresh Apply still in reconcile N, then a third Sync
  call adopts the fresh STS rather than re-deleting it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

PR Summary

Medium Risk
Medium risk because it changes reconciliation state handling for StatefulSet impostor recovery; bugs here can cause unintended deletes or reconcile loops, though the change is small and covered by targeted tests.

Overview
Prevents a cross-reconcile churn loop in SyncStatefulSet by clearing node.Status.StatefulSet when an impostor StatefulSet (UID mismatch) is deleted, so subsequent syncs take the adopt/apply path instead of repeatedly re-deleting newly applied objects.

Updates unit tests to assert the status ref is cleared after impostor deletion and adds a new test that simulates multiple SyncStatefulSet calls across and within reconciles to ensure the freshly applied StatefulSet is adopted rather than treated as another impostor.

Reviewed by Cursor Bugbot for commit bad5aaa. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham bdchatham merged commit 8701d65 into main May 21, 2026
5 checks passed
bdchatham added a commit that referenced this pull request May 21, 2026
Rolls forward the controller image pin from 7caf9ba to 8701d65 so the
platform's `?ref=main` pull picks up the SeiNode StatefulSet refactor
landed in #345 (typed SSA sync, Status.StatefulSet tracking,
pause-scales-to-zero) and the impostor-clear fix in #348.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant