feat(node): track StatefulSet on Status, switch to typed Get+Update sync, pause scales to 0#345
Conversation
…ync, pause scales to 0
The SeiNode controller now owns the StatefulSet lifecycle continuously
via a typed Get/Create/Update path, with the resulting object's
{Name, UID} recorded on Status.StatefulSet so subsequent reconciles
fetch and mutate the tracked identity directly rather than blindly
re-applying.
Lifecycle:
- First reconcile (Status.StatefulSet=nil): adopt an existing
StatefulSet with the expected name if one exists; otherwise create
one. Record {Name, UID}.
- Tracked reconciles: fetch by name; UID mismatch (out-of-band
recreation) triggers Delete+Create+re-track; NotFound (manual
delete) triggers Create+re-track.
- Replicas: Spec.Paused=true forces Replicas=0; the controller is the
authoritative writer.
The apply-statefulset plan task and the controller both call the new
noderesource.SyncStatefulSet helper, so there is one source of truth
for STS lifecycle. The task drops its server-side-apply pattern and
the corresponding nolint comment (which flagged the migration to a
typed model as a deferred effort).
Pause-scales-to-zero is exercised end-to-end: SeiNode.Spec.Paused
flips the live StatefulSet's replicas to 0; clearing the field
restores them.
Coverage:
- TestSeiNode_StatefulSetTrackedOnStatus — Status.StatefulSet
recorded with {Name, UID} on first reconcile.
- TestSeiNode_Paused_ScalesStatefulSetToZero — pause drops
StatefulSet.Spec.Replicas to 0, unpause restores to 1.
- TestSeiNode_StatefulSetRecreatedAfterDelete — out-of-band STS
deletion is reconciled and the new UID is re-recorded.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR SummaryMedium Risk Overview The node controller now syncs the StatefulSet on every reconcile via new
Reviewed by Cursor Bugbot for commit dd2b4ba. Bugbot is set up for automated code reviews on this repo. Configure here. |
…recovery, expanded coverage The K8s expert review caught a defaulting-churn bug and converged on a cleaner impostor-branch shape. Defaulting churn: - The previous mutate path used apiequality.Semantic.DeepEqual on Spec.Template, which surfaced apiserver-applied PodSpec defaults (RestartPolicy, DNSPolicy, etc.) as divergence. Every reconcile then wrote nil-defaults, the apiserver re-defaulted, the STS bumped generation, and the rollout envtests deadlocked with ReadyReplicas=0. - Switched the writer to Server-Side Apply on the typed *appsv1.StatefulSet — defaulting-immune by design. SSA only compares fields the fieldManager owns, so apiserver defaults never appear as divergence. Same fieldOwner=seinode-controller used by sibling tasks. Impostor recovery (two reconciles): - Reconcile N: detect UID mismatch, Delete the impostor, return (nil, nil) so callers requeue without writing Status. - Reconcile N+1: Owns(StatefulSet) watch fires on the delete event, the reconciler observes NotFound, the Apply creates fresh, callers stamp the new UID onto Status.StatefulSet. - Background propagation (the default) is correct: PVCs from volumeClaimTemplates don't carry BlockOwnerDeletion ownerRefs back to the STS, and PersistentVolumeClaimRetentionPolicy defaults to WhenDeleted=Retain. Foreground would add a finalizer that envtest's apiserver can't drain (no kube-controller-manager). Other review items addressed: - client.FieldOwner on Create/Update so managedFields entries unify with sibling task fieldOwner usage rather than fragmenting. - Spec.Paused doc updated: scales to Replicas=0, no effect on PhaseFailed. - Drop the re-Get after Apply: client.Patch decodes the apiserver response into desired in-place; the prior re-Get hit a cache-lag NotFound on the first Apply. Coverage: - internal/noderesource/sync_test.go (new): 6 unit tests on a fake client covering the SyncStatefulSet branches (create, adopt, update, paused-scales-to-zero, NotFound recreate, impostor delete-then-defer). - envtest TestSeiNode_StatefulSetImpostorReplaced: end-to-end exercise of the two-reconcile pattern; uses Status().Update() to forge the UID mismatch because client.MergeFrom on a pointer-typed sub-struct produces an effectively-empty patch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI lint failures from PR #345: - goconst on sync_test.go: extract testNamespace, impostorName, steadyStateName, originalTestUID, impostorTestUID. - modernize: ptr.To(int32(0)) -> new(int32). Drops the unused k8s.io/utils/ptr import. - staticcheck SA1019: client.Apply patch type is deprecated. Re-add the same nolint the original task carried — migrating to typed ApplyConfiguration builders is a substantial separate refactor. verify-generated failure: - New StatefulSetRef type needs DeepCopy generation. Ran `make generate`; updates api/v1alpha1/zz_generated.deepcopy.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ae5aecc. Configure here.
| return fmt.Errorf("syncing statefulset: %w", err) | ||
| } | ||
| if sts == nil { | ||
| return nil |
There was a problem hiding this comment.
Impostor delete races with task creating orphaned StatefulSet
Medium Severity
When reconcileStatefulSet detects an impostor (UID mismatch), it deletes it and returns nil without updating Status.StatefulSet. The controller then continues into the plan executor. If the active task is apply-statefulset, it calls SyncStatefulSet again — this time the Get returns NotFound (impostor just deleted), so it falls through and creates a fresh StatefulSet. However, Status.StatefulSet still holds the original stale UID (only the controller's reconcileStatefulSet stamps it, and it skipped that on nil). On the next reconcile, the controller sees the task-created StatefulSet as another impostor and deletes it, causing unnecessary churn and a brief pod outage.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit ae5aecc. Configure here.
Column alignment in the const () block was off by one space. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tes an impostor (#348) Cross-reconcile churn loop, caught by Cursor review of #345. When SyncStatefulSet detects a UID mismatch between the tracked Status.StatefulSet and the live object, it Deletes the live impostor and returns (nil, nil) so the next reconcile re-Applies fresh. The controller wrapper sees nil and returns without touching Status. The bug: reconcile N continues past the impostor delete into ExecutePlan, where the apply-statefulset task calls SyncStatefulSet again. Status.StatefulSet still holds the stale UID, but the live object is gone — so the call falls through the impostor check (NotFound case) and Applies fresh with a new UID. Reconcile N ends with the new live StatefulSet but Status.StatefulSet still pointing at the original stale UID. Reconcile N+1 sees the mismatch, deletes the fresh StatefulSet as another impostor, the task creates another one, loop. Brief pod outage on every cycle. Fix: when SyncStatefulSet deletes the impostor, also clear node.Status.StatefulSet. Any subsequent SyncStatefulSet — in the same reconcile or the next — routes through the adopt-on-observe path instead of mis-classifying the fresh StatefulSet. Coverage: - TestSyncStatefulSet_ImpostorThenApply extended: asserts Status.StatefulSet is nil after reconcile 1's impostor delete. - TestSyncStatefulSet_ImpostorClearPreventsChurn (new): directly exercises the cross-reconcile scenario — impostor delete in reconcile N, fresh Apply still in reconcile N, then a third Sync call adopts the fresh STS rather than re-deleting it. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Rolls forward the controller image pin from 7caf9ba to 8701d65 so the platform's `?ref=main` pull picks up the SeiNode StatefulSet refactor landed in #345 (typed SSA sync, Status.StatefulSet tracking, pause-scales-to-zero) and the impostor-clear fix in #348. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>


Summary
Two changes paired:
Status.StatefulSet={Name, UID}. The controller and theapply-statefulsettask both callnoderesource.SyncStatefulSet, which fetches the live StatefulSet by name and either creates it, updates it field-by-field, or replaces it if an impostor (UID mismatch) is detected. Server-side apply is removed from this code path; the nolint flagging the deferred ApplyConfiguration migration goes away with it.Spec.Paused=true, the SeiNode controller'sreconcileStatefulSetrenders the desired STS withReplicas=0and the typed Update lands. Pods stop. Unpause restoresReplicas=1. The controller is the authoritative writer of replicas.Architecture
Lifecycle paths
Mutable fields covered
Spec.Replicas(the pause path),Spec.Template,Spec.UpdateStrategy,Spec.MinReadySeconds,Labels,Annotations. Immutable fields (Selector,ServiceName,VolumeClaimTemplates,PodManagementPolicy) are left alone so a deterministic-but-not-byte-identical render never trips the apiserver's immutability check.Test plan
go build ./...cleango test ./...— all unit tests passmake test-integration— envtest 55.9s green, no regression in existing rollout / supersession / stuck / pause testsseinode_statefulset_test.go:TestSeiNode_StatefulSetTrackedOnStatus—Status.StatefulSet={Name, UID}recorded after first reconcileTestSeiNode_Paused_ScalesStatefulSetToZero— pause flips live STS toReplicas=0, unpause restoresTestSeiNode_StatefulSetRecreatedAfterDelete— operator-deleted STS is recreated, new UID re-trackedOut of scope
apply-statefulsetplan task entirely. It still runs (now via the typed sync helper) and remains the plan executor's sync gate forreplace-pod. Its eventual removal needs a replacement sync mechanism for the rollout plan; tracking separately.🤖 Generated with Claude Code