fix(longhorn): run CSI attacher/provisioner/resizer with 2 replicas for HA#1658
Open
devantler wants to merge 1 commit into
Open
fix(longhorn): run CSI attacher/provisioner/resizer with 2 replicas for HA#1658devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
…or HA The Longhorn CSI control-plane sidecars (csi-attacher, csi-provisioner, csi-resizer) default to a single replica in the base HelmRelease, and the prod variables ConfigMap never overrode them — so each ran as a single point of failure. On 2026-05-28 the sole csi-attacher/provisioner happened to sit on prod-worker-2, whose Cilium ClusterIP datapath had degraded after an OOMKill. With the only replica unreachable, volume attach/detach orchestration failed cluster-wide (FailedAttachVolume storms, pods stuck in ContainerCreating, and the CD deploy health gate tripping on "FailedAttachVolume ×107"). These sidecars are leader-elected, so a second replica is an idle warm standby on another node — cheap insurance that keeps CSI functioning through a single-node outage. Matches the existing "2 replicas for HA" pattern already applied to cert-manager, metrics-server, KEDA and external-secrets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces Longhorn CSI control-plane single points of failure in the prod cluster by overriding the default replica counts for the CSI sidecars to run with warm-standby redundancy.
Changes:
- Set
longhorn_csi_attacher_replicas,longhorn_csi_provisioner_replicas, andlonghorn_csi_resizer_replicasto"2"in the prod variables ConfigMap. - Document the operational rationale for running these leader-elected sidecars with 2 replicas for HA.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The Longhorn CSI control-plane sidecars (
csi-attacher,csi-provisioner,csi-resizer) default to 1 replica in the base HelmRelease (${longhorn_csi_*_replicas:=1}), and the prod variables ConfigMap never overrode them — so each is a single point of failure.During the 2026-05-28/29 prod instability, the sole
csi-attacher/csi-provisionerreplicas happened to sit onprod-worker-2, whose Cilium ClusterIP datapath had degraded after an OOMKill. With the only replica unreachable, volume attach/detach orchestration failed cluster-wide:FailedAttachVolume ×107(this is one of the two warnings that tripped theCDdeploy health gate, failing the deploy of the QoS fix in fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS #1649)ContainerCreating(e.g.kubescape/storagefor 7h+)Fix
Set
longhorn_csi_attacher_replicas/provisioner/resizerto"2"in the prod variables ConfigMap.These sidecars are leader-elected, so the second replica is an idle warm standby on another node — it keeps CSI functioning through a single-node outage at near-zero cost. This matches the existing "2 replicas for HA" convention already applied to cert-manager, metrics-server, KEDA, and external-secrets in the same file.
csi-snapshotteris intentionally left at0(its VolumeSnapshot CRDs aren't installed).Note on scope
This is preventative (blast-radius reduction). It does not by itself resolve the active outage — that requires rebuilding
prod-worker-2's Cilium BPF datapath (operator action) so the already-merged QoS fix (#1649) can deploy. Like every other manifest change right now, this PR can only take effect once GitOps reconciliation on prod recovers.Validation
kubectl kustomize k8s/clusters/prod/variables/builds; the three new vars render correctly alongside the existing HA vars.kubectl kustomize k8s/clusters/local/andk8s/clusters/prod/both build.