fix(longhorn): run CSI attacher/provisioner/resizer with 2 replicas for HA by devantler · Pull Request #1658 · devantler-tech/platform

devantler · 2026-05-29T13:39:54Z

Problem

The Longhorn CSI control-plane sidecars (csi-attacher, csi-provisioner, csi-resizer) default to 1 replica in the base HelmRelease (${longhorn_csi_*_replicas:=1}), and the prod variables ConfigMap never overrode them — so each is a single point of failure.

During the 2026-05-28/29 prod instability, the sole csi-attacher/csi-provisioner replicas happened to sit on prod-worker-2, whose Cilium ClusterIP datapath had degraded after an OOMKill. With the only replica unreachable, volume attach/detach orchestration failed cluster-wide:

FailedAttachVolume ×107 (this is one of the two warnings that tripped the CD deploy health gate, failing the deploy of the QoS fix in fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS #1649)
pods stuck in ContainerCreating (e.g. kubescape/storage for 7h+)

Fix

Set longhorn_csi_attacher_replicas / provisioner / resizer to "2" in the prod variables ConfigMap.

These sidecars are leader-elected, so the second replica is an idle warm standby on another node — it keeps CSI functioning through a single-node outage at near-zero cost. This matches the existing "2 replicas for HA" convention already applied to cert-manager, metrics-server, KEDA, and external-secrets in the same file.

csi-snapshotter is intentionally left at 0 (its VolumeSnapshot CRDs aren't installed).

Note on scope

This is preventative (blast-radius reduction). It does not by itself resolve the active outage — that requires rebuilding prod-worker-2's Cilium BPF datapath (operator action) so the already-merged QoS fix (#1649) can deploy. Like every other manifest change right now, this PR can only take effect once GitOps reconciliation on prod recovers.

Validation

kubectl kustomize k8s/clusters/prod/variables/ builds; the three new vars render correctly alongside the existing HA vars.
kubectl kustomize k8s/clusters/local/ and k8s/clusters/prod/ both build.

…or HA The Longhorn CSI control-plane sidecars (csi-attacher, csi-provisioner, csi-resizer) default to a single replica in the base HelmRelease, and the prod variables ConfigMap never overrode them — so each ran as a single point of failure. On 2026-05-28 the sole csi-attacher/provisioner happened to sit on prod-worker-2, whose Cilium ClusterIP datapath had degraded after an OOMKill. With the only replica unreachable, volume attach/detach orchestration failed cluster-wide (FailedAttachVolume storms, pods stuck in ContainerCreating, and the CD deploy health gate tripping on "FailedAttachVolume ×107"). These sidecars are leader-elected, so a second replica is an idle warm standby on another node — cheap insurance that keeps CSI functioning through a single-node outage. Matches the existing "2 replicas for HA" pattern already applied to cert-manager, metrics-server, KEDA and external-secrets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR reduces Longhorn CSI control-plane single points of failure in the prod cluster by overriding the default replica counts for the CSI sidecars to run with warm-standby redundancy.

Changes:

Set longhorn_csi_attacher_replicas, longhorn_csi_provisioner_replicas, and longhorn_csi_resizer_replicas to "2" in the prod variables ConfigMap.
Document the operational rationale for running these leader-elected sidecars with 2 replicas for HA.

Copilot AI review requested due to automatic review settings May 29, 2026 13:39

github-project-automation Bot added this to 🌊 Project Board May 29, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 29, 2026

devantler temporarily deployed to ci May 29, 2026 13:40 — with GitHub Actions Inactive

Copilot started reviewing on behalf of devantler May 29, 2026 13:40 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

devantler marked this pull request as ready for review May 29, 2026 13:55

devantler added this pull request to the merge queue May 29, 2026