fix(longhorn): reduce guaranteed-instance-manager-cpu 12% → 6% by devantler · Pull Request #1665 · devantler-tech/platform

devantler · 2026-05-29T18:42:44Z

⛔ HOLD — do not merge until the worker-2 replacement (4th worker) is back and Longhorn has replica-rebuild headroom. Applying this restarts the Longhorn data plane (instance-managers). The cluster is currently at 3 workers with replica count 3 (no N+1) after the worker-2 incident, so a data-plane restart now risks volume availability. Kept as a draft for exactly this reason.

Summary

From a live per-container measurement of prod (2026-05-29, read-only): 6 Longhorn instance-manager pods each reserve ~474m (12% of a cx33) while using 8–30m — ~2.7 cores locked, and instance-managers are not VPA-managed (so the auto-vpa policy can't right-size them).

Change

guaranteedInstanceManagerCPU 12% → 6% in the Longhorn HR defaultSettings (v1 + v2). 6% ≈ 240m on a cx33 — still 8–16× headroom over observed usage — and frees ~1.4 cores, while leaving margin for the CPU spikes during Longhorn replica rebuilds.

Why a separate, held PR

The companion right-sizing change (#1662) is GitOps-safe to land anytime. This one is not: changing the guaranteed CPU restarts instance-managers (the volume data plane). It must wait until the cluster is back to 4 workers and stable.

Validation

kubectl kustomize k8s/clusters/prod/ builds. (Helm values aren't rendered by Kustomize; the guaranteedInstanceManagerCPU v1/v2 map form matches the 1.11.2 chart's longhorn.multiTypeSetting helper.)

🤖 Generated with Claude Code

Live measurement of prod (2026-05-29): 6 Longhorn instance-managers each reserve ~474m (12% of a cx33) while using 8-30m -- ~2.7 cores locked, and instance-managers are not VPA-managed. Dropping the guarantee to 6% (~240m) keeps 8-16x headroom over observed usage and frees ~1.4 cores, while leaving margin for replica-rebuild CPU spikes. HOLD: applying this restarts the Longhorn data plane (instance-managers). Do not merge until the 4th worker is restored and Longhorn has replica-rebuild headroom -- the cluster is currently at 3 workers with replica count 3 (no N+1) following the worker-2 incident. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adjusts Longhorn’s HelmRelease default settings for Hetzner to reduce the CPU guaranteed (reserved) per Longhorn instance-manager, based on observed production usage, with the goal of freeing up schedulable CPU capacity on worker nodes.

Changes:

Reduce defaultSettings.guaranteedInstanceManagerCPU from the Longhorn default (12%) to 6% for both v1 and v2 instance-managers.
Add inline rationale documenting the measured production usage and the expected reclaimed CPU headroom.

Copilot AI review requested due to automatic review settings May 29, 2026 18:42

github-project-automation Bot added this to 🌊 Project Board May 29, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 29, 2026

Copilot started reviewing on behalf of devantler May 29, 2026 18:42 View session

devantler temporarily deployed to ci May 29, 2026 18:42 — with GitHub Actions Inactive

Copilot AI reviewed May 29, 2026

View reviewed changes

devantler marked this pull request as ready for review May 29, 2026 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(longhorn): reduce guaranteed-instance-manager-cpu 12% → 6%#1665

fix(longhorn): reduce guaranteed-instance-manager-cpu 12% → 6%#1665
devantler wants to merge 1 commit into
mainfrom
claude/longhorn-im-cpu

devantler commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 29, 2026

Summary

Change

Why a separate, held PR

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants