fix(noderesource): clear Status.StatefulSet when SyncStatefulSet deletes an impostor#348
Conversation
…tes an impostor Cross-reconcile churn loop, caught by Cursor review of #345. When SyncStatefulSet detects a UID mismatch between the tracked Status.StatefulSet and the live object, it Deletes the live impostor and returns (nil, nil) so the next reconcile re-Applies fresh. The controller wrapper sees nil and returns without touching Status. The bug: reconcile N continues past the impostor delete into ExecutePlan, where the apply-statefulset task calls SyncStatefulSet again. Status.StatefulSet still holds the stale UID, but the live object is gone — so the call falls through the impostor check (NotFound case) and Applies fresh with a new UID. Reconcile N ends with the new live StatefulSet but Status.StatefulSet still pointing at the original stale UID. Reconcile N+1 sees the mismatch, deletes the fresh StatefulSet as another impostor, the task creates another one, loop. Brief pod outage on every cycle. Fix: when SyncStatefulSet deletes the impostor, also clear node.Status.StatefulSet. Any subsequent SyncStatefulSet — in the same reconcile or the next — routes through the adopt-on-observe path instead of mis-classifying the fresh StatefulSet. Coverage: - TestSyncStatefulSet_ImpostorThenApply extended: asserts Status.StatefulSet is nil after reconcile 1's impostor delete. - TestSyncStatefulSet_ImpostorClearPreventsChurn (new): directly exercises the cross-reconcile scenario — impostor delete in reconcile N, fresh Apply still in reconcile N, then a third Sync call adopts the fresh STS rather than re-deleting it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR SummaryMedium Risk Overview Updates unit tests to assert the status ref is cleared after impostor deletion and adds a new test that simulates multiple Reviewed by Cursor Bugbot for commit bad5aaa. Bugbot is set up for automated code reviews on this repo. Configure here. |
Rolls forward the controller image pin from 7caf9ba to 8701d65 so the platform's `?ref=main` pull picks up the SeiNode StatefulSet refactor landed in #345 (typed SSA sync, Status.StatefulSet tracking, pause-scales-to-zero) and the impostor-clear fix in #348. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Closes a cross-reconcile churn loop in the impostor-recovery path introduced by #345. Cursor review caught it.
The bug
SyncStatefulSet's impostor branch deletes the live StatefulSet and returns(nil, nil), expecting the next reconcile to observe NotFound and Apply fresh. The controller wrapper sees nil and returns without updatingStatus.StatefulSet.But reconcile N continues past the wrapper into
ExecutePlan. Theapply-statefulsettask callsSyncStatefulSetagain.Status.StatefulSetstill tracks the stale UID, the live object is gone, the impostor check falls through (NotFound case), andApplycreates fresh with a new UID. End of reconcile N: live STS exists, butStatus.StatefulSetstill points at the original stale UID.Reconcile N+1: wrapper sees stale UID ≠ new live UID, deletes the fresh STS as another impostor, the task creates another, loop. Brief pod outage on every cycle.
The fix
When
SyncStatefulSetdeletes an impostor, also clearnode.Status.StatefulSet. Any subsequentSyncStatefulSetcall — same reconcile or the next — routes through the adopt-on-observe path that already exists (Status.StatefulSet == nil→ Get live STS → adopt → Apply idempotent → caller stamps Status). No misclassification, no loop.~2 lines of code + a focused unit test.
Test plan
go build ./...cleango test ./...— all unit tests passmake test-integration— envtest 56s greenTestSyncStatefulSet_ImpostorThenApply: assertsStatus.StatefulSet == nilafter the impostor deleteTestSyncStatefulSet_ImpostorClearPreventsChurn: directly exercises the cross-reconcile scenario — impostor delete, fresh Apply within the same pass, then a third Sync call adopts rather than re-deleting🤖 Generated with Claude Code