Skip to content

Emit controller-side counter distinguishing status patch vs no-op #336

@bdchatham

Description

@bdchatham

Problem

The SeiNodeDeployment controller calls updateStatus on every reconcile, which generates a merge patch via client.MergeFromWithOptimisticLock. In steady state this merge patch is zero-byte (no field-level change, apiserver short-circuits), but there's no Prometheus signal that distinguishes "patch contained changes" from "patch was a no-op." On-call has no way to answer "is this SND being over-stamped right now?" except by kubectl get -w and eyeballing lastTransitionTime deltas.

Impact

When a future regression turns steady-state reconciles into actual status writes — slice-ordering drift, a new condition that isn't latched correctly, an out-of-band CRD schema change — the signal is invisible until apiserver audit-log volume or kube_seinodedeployment_status_condition metric churn raises alarms hours later. With a controller-emitted counter, the SRE answer is a 30-second PromQL query. This is the operational mirror of the build-time guard in #335.

Relevant experts

  • opentelemetry-expert — instrumentation in controller code
  • sre-engineer — PromQL queries + alerting
  • observability-platform-engineer — recording rule if needed

Proposed approach

Instrument updateStatus in internal/controller/nodedeployment/controller.go to compute the merge-patch body, count by whether it's empty, and emit a Prometheus counter:

seinodedeployment_status_patches_total{namespace, name, result="noop|changed"}

The patch body is already computed by client.MergeFrom's Patch() internally — extract it ahead of the call so we can inspect length. The counter goes through whatever Prometheus registry the controller already uses (controller-runtime's built-in metrics endpoint).

Equivalent counter on the SeiNode controller is a natural sibling — defer until this lands and the pattern is proven.

Acceptance criteria

  • Counter increments on every reconcile that calls updateStatus, labeled by patch outcome
  • PromQL rate(seinodedeployment_status_patches_total{result="changed"}[5m]) per SND returns a stable low rate in steady state and a measurable spike during legitimate state transitions
  • Metric is registered with the controller-runtime Prometheus registry (no separate scrape endpoint)

Out of scope

  • Envtest assertion that steady-state patches are no-ops at the wire level. Filed as Add envtest asserting steady-state status patches are no-ops #335 — that catches in-tree regressions at build time; this counter catches out-of-band drift in production.
  • Alert rules on the counter. Separate effort once the metric exists and we know what "abnormal" looks like in practice.
  • Equivalent counter for SeiNode controller. Follow-up after the pattern is validated here.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions