Skip to content

fix(resources): lower CPU request floor to 15m and bound VPA recommendations#1662

Open
devantler wants to merge 3 commits into
mainfrom
claude/rightsizing-cpu-requests
Open

fix(resources): lower CPU request floor to 15m and bound VPA recommendations#1662
devantler wants to merge 3 commits into
mainfrom
claude/rightsizing-cpu-requests

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Summary

Right-sizing changes driven by a live per-container measurement of prod (2026-05-29, read-only), plus autoscaling doc-accuracy fixes.

Headline finding: CPU is only 28% utilized (11.3 cores requested vs 3.2 used — ~8 cores reserved-unused), while memory runs ~89% (tight).

Changes

  • VPA CPU floor 50m → 15mauto-vpa.yaml minAllowed.cpu + the add-ns-quota LimitRange defaultRequest.cpu. 87 containers were pinned at the 50m floor using avg 4.4m (71 use <5m) — VPA was floored, not the workloads. Frees ~3 cores of reserved-unused CPU. CPU is compressible → low risk.
  • VPA maxAllowed 3 CPU / 6Gi — bounds a runaway request from becoming unschedulable or forcing an autoscaler node. Confirmed nothing is near it today.
  • docs/node-autoscaling.md + Longhorn comment — corrected stale facts vs ksail.prod.yaml: control planes + static workers are cx33 (not cx23), 4 static workers, LeastWaste expander, autoscale-medium max 2, enabled: true.

Deliberately not changed

Memory floors — memory runs ~89% hot with only ~2.3Gi slack, and the big consumers (cilium, mysql, prometheus, apiserver) already use more than they request. Per-worker memory pressure is addressed by restoring the 4th worker, not by trimming requests.

Validation

kubectl kustomize k8s/clusters/prod/ and .../local/ both build. No live apply — this is GitOps and lands on the next artifact push. The CPU down-resize is benign even during the current worker-2 incident.

Related

A companion change (Longhorn guaranteedInstanceManagerCPU 12% → 6%, ~1.4 more cores) is a separate draft PR held until the worker-2 replacement rejoins and Longhorn regains rebuild headroom — it restarts the Longhorn data plane and must not land mid-incident (the cluster is currently at 3 workers with replica count 3, i.e. no N+1).

🤖 Generated with Claude Code

…dations

From a live per-container measurement of prod (2026-05-29, read-only): CPU is
only 28% utilized (~8 of 11.3 requested cores reserved-unused) while memory
runs ~89% hot.

- auto-vpa: VPA minAllowed.cpu 50m -> 15m. 87 containers sat at the 50m floor
  using avg 4.4m -- VPA was floored, not the workloads. Frees ~3 cores of
  reserved-unused CPU requests. CPU is compressible, so low risk; memory floor
  kept at 64Mi.
- add-ns-quota: LimitRange defaultRequest.cpu 50m -> 15m to match.
- auto-vpa: add maxAllowed (3 CPU / 6Gi) so a runaway request cannot become
  unschedulable or force an autoscaler node (nothing is near it today).
- docs(node-autoscaling) + Longhorn comment: correct stale facts vs
  ksail.prod.yaml (cx33 not cx23, 4 static workers, LeastWaste, autoscale-
  medium max 2, enabled: true).

Memory left untouched (89% used, ~2.3Gi slack) -- worker memory pressure is
addressed by restoring the 4th worker, not request cuts. Both overlays build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR right-sizes default CPU requests and VPA recommendations for the GitOps-managed Kubernetes platform, while updating autoscaling/topology documentation to match prod configuration.

Changes:

  • Lowers VPA CPU minAllowed and namespace LimitRange default CPU request from 50m to 15m.
  • Adds VPA maxAllowed bounds of 3 CPU / 6Gi.
  • Updates prod topology/autoscaling docs and Longhorn replica-count comments.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
k8s/bases/infrastructure/cluster-policies/best-practices/auto-vpa.yaml Updates generated VPA resource policies with lower CPU floors and new upper bounds.
k8s/bases/infrastructure/cluster-policies/kustomization.yaml Adjusts generated LimitRange default CPU requests.
docs/node-autoscaling.md Corrects documented prod node sizes, worker count, autoscaler settings, and expander behavior.
k8s/clusters/prod/variables/variables-cluster-config-map.yaml Updates Longhorn storage-node and replica-count comments for the 4-worker topology.

Comment thread k8s/bases/infrastructure/cluster-policies/kustomization.yaml
Comment thread k8s/bases/infrastructure/cluster-policies/best-practices/auto-vpa.yaml Outdated
Addresses review on #1662:
- add-resource-defaults: stamp requests.cpu 15m (was 50m) so the mutate policy
  matches the lowered VPA minAllowed + LimitRange defaultRequest. Otherwise pods
  are admitted at 50m before VPA has recommendations, inconsistent with the floor.
- auto-vpa: give the DaemonSet rule a cx23-sized maxAllowed (1 CPU / 1Gi) instead
  of 3 CPU / 6Gi. DaemonSets must schedule on every node including autoscale-small
  cx23 (2 vCPU / 4 GB); the cx33-sized cap could make generated DaemonSet pods
  unschedulable there. Deployments/StatefulSets keep 3 CPU / 6Gi (they land on cx33).

Both overlays build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@devantler devantler marked this pull request as ready for review May 29, 2026 19:07
Copilot AI review requested due to automatic review settings May 29, 2026 19:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread docs/node-autoscaling.md
Comment thread k8s/clusters/prod/variables/variables-cluster-config-map.yaml
Addresses review on #1662 — the autoscaling-doc + configmap fixes left sibling docs contradictory:
- README: prod is 3x cx33 control planes + 4x cx33 workers (was 3x cx23 + 3x cx23); updated the cost-table row (CX33 x7 at EUR 6.49/mo) and total.
- TEMPLATING: default Hetzner server type cx33 (was cx23).
- rwx-storage: longhorn_replica_count default 3 (was 2); replica count is one fewer than the storage-worker count for N+1 rebuild headroom.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@botantler botantler Bot enabled auto-merge May 29, 2026 19:16
@botantler botantler Bot added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants