Skip to content

fix(longhorn): run CSI attacher/provisioner/resizer with 2 replicas for HA#1658

Open
devantler wants to merge 1 commit into
mainfrom
fix/longhorn-csi-controller-ha
Open

fix(longhorn): run CSI attacher/provisioner/resizer with 2 replicas for HA#1658
devantler wants to merge 1 commit into
mainfrom
fix/longhorn-csi-controller-ha

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Problem

The Longhorn CSI control-plane sidecars (csi-attacher, csi-provisioner, csi-resizer) default to 1 replica in the base HelmRelease (${longhorn_csi_*_replicas:=1}), and the prod variables ConfigMap never overrode them — so each is a single point of failure.

During the 2026-05-28/29 prod instability, the sole csi-attacher/csi-provisioner replicas happened to sit on prod-worker-2, whose Cilium ClusterIP datapath had degraded after an OOMKill. With the only replica unreachable, volume attach/detach orchestration failed cluster-wide:

Fix

Set longhorn_csi_attacher_replicas / provisioner / resizer to "2" in the prod variables ConfigMap.

These sidecars are leader-elected, so the second replica is an idle warm standby on another node — it keeps CSI functioning through a single-node outage at near-zero cost. This matches the existing "2 replicas for HA" convention already applied to cert-manager, metrics-server, KEDA, and external-secrets in the same file.

csi-snapshotter is intentionally left at 0 (its VolumeSnapshot CRDs aren't installed).

Note on scope

This is preventative (blast-radius reduction). It does not by itself resolve the active outage — that requires rebuilding prod-worker-2's Cilium BPF datapath (operator action) so the already-merged QoS fix (#1649) can deploy. Like every other manifest change right now, this PR can only take effect once GitOps reconciliation on prod recovers.

Validation

  • kubectl kustomize k8s/clusters/prod/variables/ builds; the three new vars render correctly alongside the existing HA vars.
  • kubectl kustomize k8s/clusters/local/ and k8s/clusters/prod/ both build.

…or HA

The Longhorn CSI control-plane sidecars (csi-attacher, csi-provisioner,
csi-resizer) default to a single replica in the base HelmRelease, and the
prod variables ConfigMap never overrode them — so each ran as a single
point of failure.

On 2026-05-28 the sole csi-attacher/provisioner happened to sit on
prod-worker-2, whose Cilium ClusterIP datapath had degraded after an
OOMKill. With the only replica unreachable, volume attach/detach
orchestration failed cluster-wide (FailedAttachVolume storms, pods stuck
in ContainerCreating, and the CD deploy health gate tripping on
"FailedAttachVolume ×107").

These sidecars are leader-elected, so a second replica is an idle warm
standby on another node — cheap insurance that keeps CSI functioning
through a single-node outage. Matches the existing "2 replicas for HA"
pattern already applied to cert-manager, metrics-server, KEDA and
external-secrets.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces Longhorn CSI control-plane single points of failure in the prod cluster by overriding the default replica counts for the CSI sidecars to run with warm-standby redundancy.

Changes:

  • Set longhorn_csi_attacher_replicas, longhorn_csi_provisioner_replicas, and longhorn_csi_resizer_replicas to "2" in the prod variables ConfigMap.
  • Document the operational rationale for running these leader-elected sidecars with 2 replicas for HA.

@devantler devantler marked this pull request as ready for review May 29, 2026 13:55
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants