fix(flux): spread controllers across workers to prevent GitOps deadlock#1659
Open
devantler wants to merge 1 commit into
Open
fix(flux): spread controllers across workers to prevent GitOps deadlock#1659devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
The four Flux controllers (source/kustomize/helm/notification) are single-replica Deployments with no topology spread, so the scheduler can stack them on one worker. On 2026-05-28 kustomize-controller landed on prod-worker-2 when that node's Cilium ClusterIP datapath degraded after an OOMKill; it then crash-looped on "dial tcp 10.96.0.1:443: i/o timeout" and GitOps reconciliation stalled — so the fix for the underlying OOM (#1649) could not even be applied. A single bad worker decapitated reconciliation: a deadlock GitOps cannot self-heal from. Add a soft topologySpreadConstraint (maxSkew 1, ScheduleAnyway, keyed on app.kubernetes.io/part-of=flux) to every controller via the prod FluxInstance kustomize.patches, so the set spreads across the three workers. Soft (ScheduleAnyway) so it never blocks scheduling on the capacity-constrained cluster. Verified with a standalone kustomize build that the JSON6902 patch injects the constraint as intended. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a soft topology spread constraint to the four Flux controllers in the prod FluxInstance so they distribute across worker nodes, preventing a single bad worker from taking down GitOps reconciliation.
Changes:
- Add JSON6902 patch in prod
FluxInstancetargeting Deployments withapp.kubernetes.io/part-of=fluxto settopologySpreadConstraints(maxSkew=1, hostname, ScheduleAnyway).
Contributor
Author
|
The
Re-ran the failed job; no code change needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The four Flux controllers (
source-controller,kustomize-controller,helm-controller,notification-controller) are single-replica Deployments with no topology spread, so the scheduler is free to stack them on one worker.During the 2026-05-28/29 prod instability,
kustomize-controllerandflux-operatorwere both onprod-worker-2when that node's Cilium ClusterIP datapath degraded after an OOMKill.kustomize-controllerthen crash-looped on:…and GitOps reconciliation stalled. Because reconciliation was down, the already-merged fix for the underlying OOM (#1649) could not be applied — Cilium/SPIRE stayed in BestEffort QoS, the cluster stayed broken, and the
CDdeploy failed its health gate. A single bad worker decapitated reconciliation: a deadlock GitOps cannot self-heal from.Fix
Add a soft
topologySpreadConstraintto every controller via the prodFluxInstance.spec.kustomize.patches:maxSkew: 1,topologyKey: kubernetes.io/hostname,whenUnsatisfiable: ScheduleAnywaylabelSelectorkeyed onapp.kubernetes.io/part-of=flux— since each controller is single-replica, keying on the shared label spreads the set across nodes (≈2/1/1 over three workers) rather than spreading replicas of one Deployment.ScheduleAnyway(soft) means it expresses a preference and never blocks scheduling on the capacity-constrained 3-worker cluster.Why not flux-operator too
flux-operatoris also single-replica, but it already carries a chart-managednodeAffinity(kubernetes.io/os=linux) that anaffinityoverride would clobber, and its downtime does not stop the controllers from reconciling (it only reconciles theFluxInstanceCR itself). Left out deliberately to keep this change focused and low-risk.Validation
kubectl kustomize .../flux-instance/builds; the renderedFluxInstancecarries the patch.topologySpreadConstraintsby applying the identical patch to a samplepart-of: fluxDeployment in a standalone kustomize build (theFluxInstance's inner patches are applied by flux-operator at runtime, not bykubectl kustomize, so this was verified out-of-band).kubectl kustomize k8s/clusters/local/still builds (prod-only change).Scope
Preventative (blast-radius reduction). Does not resolve the active outage on its own — that needs
prod-worker-2's Cilium datapath rebuilt so reconciliation recovers first.