Skip to content

Add envtest asserting steady-state status patches are no-ops #335

@bdchatham

Description

@bdchatham

Problem

updateStatus in internal/controller/nodedeployment/controller.go unconditionally rebuilds Replicas, ReadyReplicas, Nodes, PerPodServices, Endpoints, Phase, and NetworkingStatus on every reconcile, then patches against statusBase via client.MergeFromWithOptimisticLock. The merge patch is zero-byte — and the apiserver doesn't bump resourceVersion — only if every one of those fields serializes identically across reconciles. populatePerPodServices and composeEndpoints are reconstruction paths whose output stability isn't asserted by any test. A single slice-ordering change or map-iteration order swap in either function silently turns every 30s reconcile (the statusPollInterval) into a real status write. Discovery today is kubectl get snd -w plus eyeballing lastTransitionTime deltas — no automated guard.

Impact

At fleet scale (~50 SNDs at 30s poll), one slice-ordering regression takes the controller from 0 to ~1.6 PATCH/s of apiserver audit volume, plus kube_seinodedeployment_status_condition metric churn, plus event-stream noise that confuses on-call during incident response. The doctrine codified in #329 promises always-present conditions; that doctrine relies on the merge patch staying empty in steady state for the always-present claim to be operationally invisible.

Relevant experts

Proposed approach

Add a focused envtest TestReconcile_SteadyState_NoStatusPatch that:

  • Creates an SND, waits for it to settle (Phase=Ready, all conditions stable)
  • Records the SND's resourceVersion
  • Asserts via Consistently over ~5-10s that the version stays put while the controller is reconciling on the 30s tick

A deliberate non-stable iteration order in populatePerPodServices or composeEndpoints should break the test on first run.

Acceptance criteria

  • TestReconcile_SteadyState_NoStatusPatch exists and passes against current code
  • Deliberately introducing a non-stable map iteration in populatePerPodServices or composeEndpoints causes the test to fail
  • Test runs in the existing make test-integration lane

Out of scope

  • Production counter distinguishing patch vs no-op via Prometheus. Filed separately so the operational signal can land independently of the build-time guard.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions