fix(resources): lower CPU request floor to 15m and bound VPA recommendations by devantler · Pull Request #1662 · devantler-tech/platform

devantler · 2026-05-29T18:40:05Z

Summary

Right-sizing changes driven by a live per-container measurement of prod (2026-05-29, read-only), plus autoscaling doc-accuracy fixes.

Headline finding: CPU is only 28% utilized (11.3 cores requested vs 3.2 used — ~8 cores reserved-unused), while memory runs ~89% (tight).

Changes

VPA CPU floor 50m → 15m — auto-vpa.yaml minAllowed.cpu + the add-ns-quota LimitRange defaultRequest.cpu. 87 containers were pinned at the 50m floor using avg 4.4m (71 use <5m) — VPA was floored, not the workloads. Frees ~3 cores of reserved-unused CPU. CPU is compressible → low risk.
VPA maxAllowed 3 CPU / 6Gi — bounds a runaway request from becoming unschedulable or forcing an autoscaler node. Confirmed nothing is near it today.
docs/node-autoscaling.md + Longhorn comment — corrected stale facts vs ksail.prod.yaml: control planes + static workers are cx33 (not cx23), 4 static workers, LeastWaste expander, autoscale-medium max 2, enabled: true.

Deliberately not changed

Memory floors — memory runs ~89% hot with only ~2.3Gi slack, and the big consumers (cilium, mysql, prometheus, apiserver) already use more than they request. Per-worker memory pressure is addressed by restoring the 4th worker, not by trimming requests.

Validation

kubectl kustomize k8s/clusters/prod/ and .../local/ both build. No live apply — this is GitOps and lands on the next artifact push. The CPU down-resize is benign even during the current worker-2 incident.

A companion change (Longhorn guaranteedInstanceManagerCPU 12% → 6%, ~1.4 more cores) is a separate draft PR held until the worker-2 replacement rejoins and Longhorn regains rebuild headroom — it restarts the Longhorn data plane and must not land mid-incident (the cluster is currently at 3 workers with replica count 3, i.e. no N+1).

🤖 Generated with Claude Code

…dations From a live per-container measurement of prod (2026-05-29, read-only): CPU is only 28% utilized (~8 of 11.3 requested cores reserved-unused) while memory runs ~89% hot. - auto-vpa: VPA minAllowed.cpu 50m -> 15m. 87 containers sat at the 50m floor using avg 4.4m -- VPA was floored, not the workloads. Frees ~3 cores of reserved-unused CPU requests. CPU is compressible, so low risk; memory floor kept at 64Mi. - add-ns-quota: LimitRange defaultRequest.cpu 50m -> 15m to match. - auto-vpa: add maxAllowed (3 CPU / 6Gi) so a runaway request cannot become unschedulable or force an autoscaler node (nothing is near it today). - docs(node-autoscaling) + Longhorn comment: correct stale facts vs ksail.prod.yaml (cx33 not cx23, 4 static workers, LeastWaste, autoscale- medium max 2, enabled: true). Memory left untouched (89% used, ~2.3Gi slack) -- worker memory pressure is addressed by restoring the 4th worker, not request cuts. Both overlays build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR right-sizes default CPU requests and VPA recommendations for the GitOps-managed Kubernetes platform, while updating autoscaling/topology documentation to match prod configuration.

Changes:

Lowers VPA CPU minAllowed and namespace LimitRange default CPU request from 50m to 15m.
Adds VPA maxAllowed bounds of 3 CPU / 6Gi.
Updates prod topology/autoscaling docs and Longhorn replica-count comments.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`k8s/bases/infrastructure/cluster-policies/best-practices/auto-vpa.yaml`	Updates generated VPA resource policies with lower CPU floors and new upper bounds.
`k8s/bases/infrastructure/cluster-policies/kustomization.yaml`	Adjusts generated LimitRange default CPU requests.
`docs/node-autoscaling.md`	Corrects documented prod node sizes, worker count, autoscaler settings, and expander behavior.
`k8s/clusters/prod/variables/variables-cluster-config-map.yaml`	Updates Longhorn storage-node and replica-count comments for the 4-worker topology.

Addresses review on #1662: - add-resource-defaults: stamp requests.cpu 15m (was 50m) so the mutate policy matches the lowered VPA minAllowed + LimitRange defaultRequest. Otherwise pods are admitted at 50m before VPA has recommendations, inconsistent with the floor. - auto-vpa: give the DaemonSet rule a cx23-sized maxAllowed (1 CPU / 1Gi) instead of 3 CPU / 6Gi. DaemonSets must schedule on every node including autoscale-small cx23 (2 vCPU / 4 GB); the cx33-sized cap could make generated DaemonSet pods unschedulable there. Deployments/StatefulSets keep 3 CPU / 6Gi (they land on cx33). Both overlays build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Addresses review on #1662 — the autoscaling-doc + configmap fixes left sibling docs contradictory: - README: prod is 3x cx33 control planes + 4x cx33 workers (was 3x cx23 + 3x cx23); updated the cost-table row (CX33 x7 at EUR 6.49/mo) and total. - TEMPLATING: default Hetzner server type cx33 (was cx23). - rwx-storage: longhorn_replica_count default 3 (was 2); replica count is one fewer than the storage-worker count for N+1 rebuild headroom. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 29, 2026 18:40

github-project-automation Bot added this to 🌊 Project Board May 29, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 29, 2026

devantler had a problem deploying to ci May 29, 2026 18:40 — with GitHub Actions Failure

Copilot started reviewing on behalf of devantler May 29, 2026 18:40 View session

devantler mentioned this pull request May 29, 2026

fix(longhorn): reduce guaranteed-instance-manager-cpu 12% → 6% #1665

Open

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread k8s/bases/infrastructure/cluster-policies/kustomization.yaml

Comment thread k8s/bases/infrastructure/cluster-policies/best-practices/auto-vpa.yaml Outdated

devantler had a problem deploying to ci May 29, 2026 18:51 — with GitHub Actions Error

devantler marked this pull request as ready for review May 29, 2026 19:07

Copilot AI review requested due to automatic review settings May 29, 2026 19:07

Copilot started reviewing on behalf of devantler May 29, 2026 19:07 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread docs/node-autoscaling.md

Comment thread k8s/clusters/prod/variables/variables-cluster-config-map.yaml

botantler Bot approved these changes May 29, 2026

View reviewed changes

botantler Bot enabled auto-merge May 29, 2026 19:16

devantler temporarily deployed to ci May 29, 2026 19:17 — with GitHub Actions Inactive

botantler Bot added this pull request to the merge queue May 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(resources): lower CPU request floor to 15m and bound VPA recommendations#1662

fix(resources): lower CPU request floor to 15m and bound VPA recommendations#1662
devantler wants to merge 3 commits into
mainfrom
claude/rightsizing-cpu-requests

devantler commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 29, 2026

Summary

Changes

Deliberately not changed

Validation

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants