fix(resources): lower CPU request floor to 15m and bound VPA recommendations#1662
Open
devantler wants to merge 3 commits into
Open
fix(resources): lower CPU request floor to 15m and bound VPA recommendations#1662devantler wants to merge 3 commits into
devantler wants to merge 3 commits into
Conversation
…dations From a live per-container measurement of prod (2026-05-29, read-only): CPU is only 28% utilized (~8 of 11.3 requested cores reserved-unused) while memory runs ~89% hot. - auto-vpa: VPA minAllowed.cpu 50m -> 15m. 87 containers sat at the 50m floor using avg 4.4m -- VPA was floored, not the workloads. Frees ~3 cores of reserved-unused CPU requests. CPU is compressible, so low risk; memory floor kept at 64Mi. - add-ns-quota: LimitRange defaultRequest.cpu 50m -> 15m to match. - auto-vpa: add maxAllowed (3 CPU / 6Gi) so a runaway request cannot become unschedulable or force an autoscaler node (nothing is near it today). - docs(node-autoscaling) + Longhorn comment: correct stale facts vs ksail.prod.yaml (cx33 not cx23, 4 static workers, LeastWaste, autoscale- medium max 2, enabled: true). Memory left untouched (89% used, ~2.3Gi slack) -- worker memory pressure is addressed by restoring the 4th worker, not request cuts. Both overlays build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR right-sizes default CPU requests and VPA recommendations for the GitOps-managed Kubernetes platform, while updating autoscaling/topology documentation to match prod configuration.
Changes:
- Lowers VPA CPU
minAllowedand namespace LimitRange default CPU request from 50m to 15m. - Adds VPA
maxAllowedbounds of 3 CPU / 6Gi. - Updates prod topology/autoscaling docs and Longhorn replica-count comments.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
k8s/bases/infrastructure/cluster-policies/best-practices/auto-vpa.yaml |
Updates generated VPA resource policies with lower CPU floors and new upper bounds. |
k8s/bases/infrastructure/cluster-policies/kustomization.yaml |
Adjusts generated LimitRange default CPU requests. |
docs/node-autoscaling.md |
Corrects documented prod node sizes, worker count, autoscaler settings, and expander behavior. |
k8s/clusters/prod/variables/variables-cluster-config-map.yaml |
Updates Longhorn storage-node and replica-count comments for the 4-worker topology. |
Addresses review on #1662: - add-resource-defaults: stamp requests.cpu 15m (was 50m) so the mutate policy matches the lowered VPA minAllowed + LimitRange defaultRequest. Otherwise pods are admitted at 50m before VPA has recommendations, inconsistent with the floor. - auto-vpa: give the DaemonSet rule a cx23-sized maxAllowed (1 CPU / 1Gi) instead of 3 CPU / 6Gi. DaemonSets must schedule on every node including autoscale-small cx23 (2 vCPU / 4 GB); the cx33-sized cap could make generated DaemonSet pods unschedulable there. Deployments/StatefulSets keep 3 CPU / 6Gi (they land on cx33). Both overlays build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses review on #1662 — the autoscaling-doc + configmap fixes left sibling docs contradictory: - README: prod is 3x cx33 control planes + 4x cx33 workers (was 3x cx23 + 3x cx23); updated the cost-table row (CX33 x7 at EUR 6.49/mo) and total. - TEMPLATING: default Hetzner server type cx33 (was cx23). - rwx-storage: longhorn_replica_count default 3 (was 2); replica count is one fewer than the storage-worker count for N+1 rebuild headroom. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Right-sizing changes driven by a live per-container measurement of prod (2026-05-29, read-only), plus autoscaling doc-accuracy fixes.
Headline finding: CPU is only 28% utilized (11.3 cores requested vs 3.2 used — ~8 cores reserved-unused), while memory runs ~89% (tight).
Changes
auto-vpa.yamlminAllowed.cpu+ theadd-ns-quotaLimitRangedefaultRequest.cpu. 87 containers were pinned at the 50m floor using avg 4.4m (71 use <5m) — VPA was floored, not the workloads. Frees ~3 cores of reserved-unused CPU. CPU is compressible → low risk.maxAllowed3 CPU / 6Gi — bounds a runaway request from becoming unschedulable or forcing an autoscaler node. Confirmed nothing is near it today.docs/node-autoscaling.md+ Longhorn comment — corrected stale facts vsksail.prod.yaml: control planes + static workers are cx33 (not cx23), 4 static workers, LeastWaste expander, autoscale-medium max 2,enabled: true.Deliberately not changed
Memory floors — memory runs ~89% hot with only ~2.3Gi slack, and the big consumers (cilium, mysql, prometheus, apiserver) already use more than they request. Per-worker memory pressure is addressed by restoring the 4th worker, not by trimming requests.
Validation
kubectl kustomize k8s/clusters/prod/and.../local/both build. No live apply — this is GitOps and lands on the next artifact push. The CPU down-resize is benign even during the current worker-2 incident.Related
A companion change (Longhorn
guaranteedInstanceManagerCPU12% → 6%, ~1.4 more cores) is a separate draft PR held until the worker-2 replacement rejoins and Longhorn regains rebuild headroom — it restarts the Longhorn data plane and must not land mid-incident (the cluster is currently at 3 workers with replica count 3, i.e. no N+1).🤖 Generated with Claude Code