Skip to content

fix(flux): spread controllers across workers to prevent GitOps deadlock#1659

Open
devantler wants to merge 1 commit into
mainfrom
fix/flux-controller-topology-spread
Open

fix(flux): spread controllers across workers to prevent GitOps deadlock#1659
devantler wants to merge 1 commit into
mainfrom
fix/flux-controller-topology-spread

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Problem

The four Flux controllers (source-controller, kustomize-controller, helm-controller, notification-controller) are single-replica Deployments with no topology spread, so the scheduler is free to stack them on one worker.

During the 2026-05-28/29 prod instability, kustomize-controller and flux-operator were both on prod-worker-2 when that node's Cilium ClusterIP datapath degraded after an OOMKill. kustomize-controller then crash-looped on:

Get "https://10.96.0.1:443/api": dial tcp 10.96.0.1:443: i/o timeout

…and GitOps reconciliation stalled. Because reconciliation was down, the already-merged fix for the underlying OOM (#1649) could not be applied — Cilium/SPIRE stayed in BestEffort QoS, the cluster stayed broken, and the CD deploy failed its health gate. A single bad worker decapitated reconciliation: a deadlock GitOps cannot self-heal from.

Fix

Add a soft topologySpreadConstraint to every controller via the prod FluxInstance.spec.kustomize.patches:

  • maxSkew: 1, topologyKey: kubernetes.io/hostname, whenUnsatisfiable: ScheduleAnyway
  • labelSelector keyed on app.kubernetes.io/part-of=flux — since each controller is single-replica, keying on the shared label spreads the set across nodes (≈2/1/1 over three workers) rather than spreading replicas of one Deployment.

ScheduleAnyway (soft) means it expresses a preference and never blocks scheduling on the capacity-constrained 3-worker cluster.

Why not flux-operator too

flux-operator is also single-replica, but it already carries a chart-managed nodeAffinity (kubernetes.io/os=linux) that an affinity override would clobber, and its downtime does not stop the controllers from reconciling (it only reconciles the FluxInstance CR itself). Left out deliberately to keep this change focused and low-risk.

Validation

  • kubectl kustomize .../flux-instance/ builds; the rendered FluxInstance carries the patch.
  • Proved the JSON6902 patch actually injects topologySpreadConstraints by applying the identical patch to a sample part-of: flux Deployment in a standalone kustomize build (the FluxInstance's inner patches are applied by flux-operator at runtime, not by kubectl kustomize, so this was verified out-of-band).
  • kubectl kustomize k8s/clusters/local/ still builds (prod-only change).

Scope

Preventative (blast-radius reduction). Does not resolve the active outage on its own — that needs prod-worker-2's Cilium datapath rebuilt so reconciliation recovers first.

The four Flux controllers (source/kustomize/helm/notification) are
single-replica Deployments with no topology spread, so the scheduler can
stack them on one worker.

On 2026-05-28 kustomize-controller landed on prod-worker-2 when that
node's Cilium ClusterIP datapath degraded after an OOMKill; it then
crash-looped on "dial tcp 10.96.0.1:443: i/o timeout" and GitOps
reconciliation stalled — so the fix for the underlying OOM (#1649) could
not even be applied. A single bad worker decapitated reconciliation: a
deadlock GitOps cannot self-heal from.

Add a soft topologySpreadConstraint (maxSkew 1, ScheduleAnyway, keyed on
app.kubernetes.io/part-of=flux) to every controller via the prod
FluxInstance kustomize.patches, so the set spreads across the three
workers. Soft (ScheduleAnyway) so it never blocks scheduling on the
capacity-constrained cluster. Verified with a standalone kustomize build
that the JSON6902 patch injects the constraint as intended.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a soft topology spread constraint to the four Flux controllers in the prod FluxInstance so they distribute across worker nodes, preventing a single bad worker from taking down GitOps reconciliation.

Changes:

  • Add JSON6902 patch in prod FluxInstance targeting Deployments with app.kubernetes.io/part-of=flux to set topologySpreadConstraints (maxSkew=1, hostname, ScheduleAnyway).

@devantler
Copy link
Copy Markdown
Contributor Author

The 🧪 System Test failure here is an unrelated CI infrastructure flake, not this change:

Re-ran the failed job; no code change needed.

@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants