fix(flux): spread controllers across workers to prevent GitOps deadlock by devantler · Pull Request #1659 · devantler-tech/platform

devantler · 2026-05-29T13:44:03Z

Problem

The four Flux controllers (source-controller, kustomize-controller, helm-controller, notification-controller) are single-replica Deployments with no topology spread, so the scheduler is free to stack them on one worker.

During the 2026-05-28/29 prod instability, kustomize-controller and flux-operator were both on prod-worker-2 when that node's Cilium ClusterIP datapath degraded after an OOMKill. kustomize-controller then crash-looped on:

Get "https://10.96.0.1:443/api": dial tcp 10.96.0.1:443: i/o timeout

…and GitOps reconciliation stalled. Because reconciliation was down, the already-merged fix for the underlying OOM (#1649) could not be applied — Cilium/SPIRE stayed in BestEffort QoS, the cluster stayed broken, and the CD deploy failed its health gate. A single bad worker decapitated reconciliation: a deadlock GitOps cannot self-heal from.

Fix

Add a soft topologySpreadConstraint to every controller via the prod FluxInstance.spec.kustomize.patches:

maxSkew: 1, topologyKey: kubernetes.io/hostname, whenUnsatisfiable: ScheduleAnyway
labelSelector keyed on app.kubernetes.io/part-of=flux — since each controller is single-replica, keying on the shared label spreads the set across nodes (≈2/1/1 over three workers) rather than spreading replicas of one Deployment.

ScheduleAnyway (soft) means it expresses a preference and never blocks scheduling on the capacity-constrained 3-worker cluster.

Why not flux-operator too

flux-operator is also single-replica, but it already carries a chart-managed nodeAffinity (kubernetes.io/os=linux) that an affinity override would clobber, and its downtime does not stop the controllers from reconciling (it only reconciles the FluxInstance CR itself). Left out deliberately to keep this change focused and low-risk.

Validation

kubectl kustomize .../flux-instance/ builds; the rendered FluxInstance carries the patch.
Proved the JSON6902 patch actually injects topologySpreadConstraints by applying the identical patch to a sample part-of: flux Deployment in a standalone kustomize build (the FluxInstance's inner patches are applied by flux-operator at runtime, not by kubectl kustomize, so this was verified out-of-band).
kubectl kustomize k8s/clusters/local/ still builds (prod-only change).

Scope

Preventative (blast-radius reduction). Does not resolve the active outage on its own — that needs prod-worker-2's Cilium datapath rebuilt so reconciliation recovers first.

The four Flux controllers (source/kustomize/helm/notification) are single-replica Deployments with no topology spread, so the scheduler can stack them on one worker. On 2026-05-28 kustomize-controller landed on prod-worker-2 when that node's Cilium ClusterIP datapath degraded after an OOMKill; it then crash-looped on "dial tcp 10.96.0.1:443: i/o timeout" and GitOps reconciliation stalled — so the fix for the underlying OOM (#1649) could not even be applied. A single bad worker decapitated reconciliation: a deadlock GitOps cannot self-heal from. Add a soft topologySpreadConstraint (maxSkew 1, ScheduleAnyway, keyed on app.kubernetes.io/part-of=flux) to every controller via the prod FluxInstance kustomize.patches, so the set spreads across the three workers. Soft (ScheduleAnyway) so it never blocks scheduling on the capacity-constrained cluster. Verified with a standalone kustomize build that the JSON6902 patch injects the constraint as intended. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a soft topology spread constraint to the four Flux controllers in the prod FluxInstance so they distribute across worker nodes, preventing a single bad worker from taking down GitOps reconciliation.

Changes:

Add JSON6902 patch in prod FluxInstance targeting Deployments with app.kubernetes.io/part-of=flux to set topologySpreadConstraints (maxSkew=1, hostname, ScheduleAnyway).

devantler · 2026-05-29T15:38:09Z

The 🧪 System Test failure here is an unrelated CI infrastructure flake, not this change:

Root cause: cilium-bd4qf hit ErrImagePull → ImagePullBackOff on a containerd content-store digest mismatch (unexpected commit digest … failed precondition) pulling quay.io/cilium/cilium. That node's CNI never came up, which cascaded into FailedMount secret/configmap cache-sync timeouts and Helm releases hitting context canceled at the reconcile deadline (dex/cdi/cnpg/cert-manager). There was also a transient local-local-registry … MANIFEST_UNKNOWN tag=dev during bring-up.
This change is hetzner-only (the prod FluxInstance); the local system test runs the docker provider's FluxInstance, which is untouched here.
The two sibling PRs (fix(longhorn): run CSI attacher/provisioner/resizer with 2 replicas for HA #1658, fix(cilium): keep spire-server off the Flux-controller node (soft anti-affinity) #1660) passed the identical System Test on the same base commit — fix(flux): spread controllers across workers to prevent GitOps deadlock #1659 just drew the image-pull flake.

Re-ran the failed job; no code change needed.

Copilot AI review requested due to automatic review settings May 29, 2026 13:44

github-project-automation Bot added this to 🌊 Project Board May 29, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 29, 2026

Copilot started reviewing on behalf of devantler May 29, 2026 13:44 View session

devantler had a problem deploying to ci May 29, 2026 13:44 — with GitHub Actions Failure

Copilot AI reviewed May 29, 2026

View reviewed changes

devantler mentioned this pull request May 29, 2026

fix(cilium): keep spire-server off the Flux-controller node (soft anti-affinity) #1660

Closed

devantler marked this pull request as ready for review May 29, 2026 13:55

devantler enabled auto-merge May 29, 2026 15:33

devantler temporarily deployed to ci May 29, 2026 15:38 — with GitHub Actions Inactive

devantler added this pull request to the merge queue May 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(flux): spread controllers across workers to prevent GitOps deadlock#1659

fix(flux): spread controllers across workers to prevent GitOps deadlock#1659
devantler wants to merge 1 commit into
mainfrom
fix/flux-controller-topology-spread

devantler commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

devantler commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 29, 2026

Problem

Fix

Why not flux-operator too

Validation

Scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

devantler commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants