Skip to content

fix(longhorn): reduce guaranteed-instance-manager-cpu 12% → 6%#1665

Open
devantler wants to merge 1 commit into
mainfrom
claude/longhorn-im-cpu
Open

fix(longhorn): reduce guaranteed-instance-manager-cpu 12% → 6%#1665
devantler wants to merge 1 commit into
mainfrom
claude/longhorn-im-cpu

Conversation

@devantler
Copy link
Copy Markdown
Contributor

HOLD — do not merge until the worker-2 replacement (4th worker) is back and Longhorn has replica-rebuild headroom. Applying this restarts the Longhorn data plane (instance-managers). The cluster is currently at 3 workers with replica count 3 (no N+1) after the worker-2 incident, so a data-plane restart now risks volume availability. Kept as a draft for exactly this reason.

Summary

From a live per-container measurement of prod (2026-05-29, read-only): 6 Longhorn instance-manager pods each reserve ~474m (12% of a cx33) while using 8–30m — ~2.7 cores locked, and instance-managers are not VPA-managed (so the auto-vpa policy can't right-size them).

Change

guaranteedInstanceManagerCPU 12% → 6% in the Longhorn HR defaultSettings (v1 + v2). 6% ≈ 240m on a cx33 — still 8–16× headroom over observed usage — and frees ~1.4 cores, while leaving margin for the CPU spikes during Longhorn replica rebuilds.

Why a separate, held PR

The companion right-sizing change (#1662) is GitOps-safe to land anytime. This one is not: changing the guaranteed CPU restarts instance-managers (the volume data plane). It must wait until the cluster is back to 4 workers and stable.

Validation

kubectl kustomize k8s/clusters/prod/ builds. (Helm values aren't rendered by Kustomize; the guaranteedInstanceManagerCPU v1/v2 map form matches the 1.11.2 chart's longhorn.multiTypeSetting helper.)

🤖 Generated with Claude Code

Live measurement of prod (2026-05-29): 6 Longhorn instance-managers each
reserve ~474m (12% of a cx33) while using 8-30m -- ~2.7 cores locked, and
instance-managers are not VPA-managed. Dropping the guarantee to 6% (~240m)
keeps 8-16x headroom over observed usage and frees ~1.4 cores, while leaving
margin for replica-rebuild CPU spikes.

HOLD: applying this restarts the Longhorn data plane (instance-managers).
Do not merge until the 4th worker is restored and Longhorn has replica-rebuild
headroom -- the cluster is currently at 3 workers with replica count 3 (no
N+1) following the worker-2 incident.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Longhorn’s HelmRelease default settings for Hetzner to reduce the CPU guaranteed (reserved) per Longhorn instance-manager, based on observed production usage, with the goal of freeing up schedulable CPU capacity on worker nodes.

Changes:

  • Reduce defaultSettings.guaranteedInstanceManagerCPU from the Longhorn default (12%) to 6% for both v1 and v2 instance-managers.
  • Add inline rationale documenting the measured production usage and the expected reclaimed CPU headroom.

@devantler devantler marked this pull request as ready for review May 29, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants