Problem
updateStatus in internal/controller/nodedeployment/controller.go unconditionally rebuilds Replicas, ReadyReplicas, Nodes, PerPodServices, Endpoints, Phase, and NetworkingStatus on every reconcile, then patches against statusBase via client.MergeFromWithOptimisticLock. The merge patch is zero-byte — and the apiserver doesn't bump resourceVersion — only if every one of those fields serializes identically across reconciles. populatePerPodServices and composeEndpoints are reconstruction paths whose output stability isn't asserted by any test. A single slice-ordering change or map-iteration order swap in either function silently turns every 30s reconcile (the statusPollInterval) into a real status write. Discovery today is kubectl get snd -w plus eyeballing lastTransitionTime deltas — no automated guard.
Impact
At fleet scale (~50 SNDs at 30s poll), one slice-ordering regression takes the controller from 0 to ~1.6 PATCH/s of apiserver audit volume, plus kube_seinodedeployment_status_condition metric churn, plus event-stream noise that confuses on-call during incident response. The doctrine codified in #329 promises always-present conditions; that doctrine relies on the merge patch staying empty in steady state for the always-present claim to be operationally invisible.
Relevant experts
Proposed approach
Add a focused envtest TestReconcile_SteadyState_NoStatusPatch that:
- Creates an SND, waits for it to settle (Phase=Ready, all conditions stable)
- Records the SND's
resourceVersion
- Asserts via
Consistently over ~5-10s that the version stays put while the controller is reconciling on the 30s tick
A deliberate non-stable iteration order in populatePerPodServices or composeEndpoints should break the test on first run.
Acceptance criteria
Out of scope
- Production counter distinguishing patch vs no-op via Prometheus. Filed separately so the operational signal can land independently of the build-time guard.
References
Problem
updateStatusininternal/controller/nodedeployment/controller.gounconditionally rebuildsReplicas,ReadyReplicas,Nodes,PerPodServices,Endpoints,Phase, andNetworkingStatuson every reconcile, then patches againststatusBaseviaclient.MergeFromWithOptimisticLock. The merge patch is zero-byte — and the apiserver doesn't bumpresourceVersion— only if every one of those fields serializes identically across reconciles.populatePerPodServicesandcomposeEndpointsare reconstruction paths whose output stability isn't asserted by any test. A single slice-ordering change or map-iteration order swap in either function silently turns every 30s reconcile (thestatusPollInterval) into a real status write. Discovery today iskubectl get snd -wplus eyeballinglastTransitionTimedeltas — no automated guard.Impact
At fleet scale (~50 SNDs at 30s poll), one slice-ordering regression takes the controller from 0 to ~1.6 PATCH/s of apiserver audit volume, plus
kube_seinodedeployment_status_conditionmetric churn, plus event-stream noise that confuses on-call during incident response. The doctrine codified in #329 promises always-present conditions; that doctrine relies on the merge patch staying empty in steady state for the always-present claim to be operationally invisible.Relevant experts
sre-engineer— surfaced this in the PR refactor(snd): fold GenesisCeremonyNeeded into GenesisCeremonyComplete; make spec.genesis immutable #333 adversarial reviewkubernetes-specialist— envtest harnessProposed approach
Add a focused envtest
TestReconcile_SteadyState_NoStatusPatchthat:resourceVersionConsistentlyover ~5-10s that the version stays put while the controller is reconciling on the 30s tickA deliberate non-stable iteration order in
populatePerPodServicesorcomposeEndpointsshould break the test on first run.Acceptance criteria
TestReconcile_SteadyState_NoStatusPatchexists and passes against current codepopulatePerPodServicesorcomposeEndpointscauses the test to failmake test-integrationlaneOut of scope
References
GenesisCeremonyComplete, which raises the cost of any steady-state status regression)populatePerPodServicesandcomposeEndpointsare reconstruction paths whose output stability isn't asserted by a test. One slice-ordering change and every 30s reconcile becomes a real patch. That test is the operational contract."