feat(node): graceful memory-pressure handling + QoS for all critical components by devantler · Pull Request #1667 · devantler-tech/platform

devantler · 2026-05-30T07:11:07Z

🤖 Generated by the Daily AI Assistant

Make the prod (Hetzner) cluster degrade gracefully under node memory pressure, and guarantee every cluster-critical component is last to be evicted.

1. Kubelet memory eviction — `talos/cluster/kubelet.yaml` (new)

machine.kubelet.extraConfig so a node sheds load early and keeps OS/kubelet headroom instead of hitting the kernel OOM-killer:

Setting	Value	Purpose
`systemReserved.memory` / `kubeReserved.memory`	`256Mi` each	Reserve out of allocatable so pods can't starve the OS/kubelet.
`evictionSoft.memory.available` + grace	`500Mi`, `1m30s`	Evict lowest-priority pods cleanly before it's critical.
`evictionMaxPodGracePeriod`	`60`	Bound the soft-eviction drain.
`evictionMinimumReclaim.memory.available`	`200Mi`	Avoid immediate re-trigger.
`evictionHard.memory.available`	`100Mi`	Last-resort floor.

Node-pressure eviction is priority-aware and skips critical pods → workload pods shed first.

2. QoS / priority for all critical components

Audited every component in scope (CNI, CSI, CDI, monitoring, kube-apiserver/etcd/scheduler/controller-manager, cloud-controller-manager, DNS). Most were already protected — this PR closes the real gaps.

Component	Status	Priority class
Control plane (apiserver/etcd/scheduler/controller-manager)	✅ already	Talos static pods — inherently exempt from kubelet eviction
CNI — Cilium agent / operator	✅ already	chart helper → `system-node-critical` / `system-cluster-critical`
CSI — hcloud-csi	✅ this PR	controller → `system-cluster-critical`, node DaemonSet → `system-node-critical` (chart left both unset)
CSI — Longhorn	✅ already	chart default → `longhorn-critical` (all components)
CCM — hcloud-ccm	✅ already	chart default → `system-cluster-critical`
DNS — CoreDNS	✅ already	docker: explicit `system-cluster-critical`; prod: Talos-managed default
metrics-server	✅ already	chart default → `system-cluster-critical`
Monitoring — kube-prometheus-stack	✅ this PR	ran at priority 0 → new `platform-critical` on Prometheus, Alertmanager, operator, node-exporter, kube-state-metrics
CDI	✅ this PR	ran at priority 0 → CR `spec.priorityClass: kubevirt-cluster-critical`

New `platform-critical` PriorityClass (value `1000000000`)

For important platform add-ons (monitoring) that aren't core k8s/node infra. Sits above normal workloads (evicted last) but below the system-* classes (2000000000) so true infra still wins — and deliberately not system-cluster-critical, which would make Prometheus eviction-exempt and able to drive the node into kernel OOM. Peer of the existing kubevirt-cluster-critical (1000000000). CDI reuses kubevirt-cluster-critical since it's a KubeVirt subproject and that class already exists.

Scope / trade-offs

Kubelet eviction config is prod only (talos/); the memory-shared CI/Docker cluster could evict pods during the system-test. Trivial to add talos-local/ parity later.
Monitoring/CDI priority changes live in k8s/bases/, so they apply to local + prod and are exercised by the CI system-test.
The cdi-operator singleton reconciler is intentionally left at default priority (not data-path; restarts harmlessly). The CDI control plane (apiserver/controller/uploadproxy) is covered.

⚠️ Kubelet config applies on the next ksail --config ksail.prod.yaml cluster update; the priority classes apply on the next Flux reconcile. No effect from merging alone.

Validation

talosctl gen → machineconfig patch @talos/cluster/kubelet.yaml → validate -m cloud --strict (controlplane + worker): valid.
ksail --config ksail.prod.yaml workload validate and ksail workload validate: 257 files each.
kubectl kustomize (hetzner controllers): platform-critical PriorityClass renders; all 5 monitoring priorityClassName refs + CDI priorityClass present.
Full Talos+Docker system-test runs in CI on this PR.

Follow-ups (not here)

Optional talos-local/ kubelet parity.

…be-system pods Two complementary changes so the prod (Hetzner) cluster degrades gracefully under node memory pressure instead of letting the kernel OOM-killer reap a node-critical daemon. 1. talos/cluster/kubelet.yaml — kubelet memory eviction config: - systemReserved/kubeReserved (256Mi each) carve headroom out of node-allocatable so pods can never starve the OS or the kubelet. - evictionSoft (memory.available<500Mi, 90s grace) sheds the lowest-priority pods *before* the hard floor, with evictionMaxPodGracePeriod=60 bounding the drain and evictionMinimumReclaim=200Mi avoiding immediate re-trigger. - evictionHard (memory.available<100Mi) stays as the last-resort floor. Node-pressure eviction is priority-aware and skips critical pods, so workload pods are always shed before the kube-system control/storage plane. 2. hcloud-csi HelmRelease — assign priority classes (chart default leaves both unset → priority 0, making them first eviction candidates): - controller -> system-cluster-critical (provisioning/attach control plane) - node (DaemonSet) -> system-node-critical (per-node volume mount/unmount) Critical pods are exempt from node-pressure eviction, so storage keeps working under memory pressure. (hcloud-ccm already defaults to system-cluster-critical, so no change there.) Scope: prod only. The local/CI Docker cluster shares host memory across node containers, where absolute eviction thresholds could evict pods during the CI system-test; left out deliberately and can be added if desired. Validation: - talosctl gen + machineconfig patch + validate -m cloud --strict on both controlplane and worker: valid; extraConfig merges as expected. - ksail --config ksail.prod.yaml workload validate: 256 files validated. - ksail workload validate (local): 256 files validated (shared build intact). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Introduces production (Hetzner) safeguards to degrade more gracefully under node memory pressure by having the kubelet evict workload pods earlier and by ensuring storage-critical CSI pods are deprioritized for eviction.

Changes:

Add Talos kubelet extraConfig to reserve memory and configure soft/hard eviction thresholds for memory.available.
Set priorityClassName for hcloud-csi controller and node components to reduce their likelihood of being evicted under pressure.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`talos/cluster/kubelet.yaml`	Adds kubelet memory reservation + eviction threshold configuration for earlier, priority-aware eviction behavior.
`k8s/providers/hetzner/infrastructure/controllers/hcloud-csi/helm-release.yaml`	Assigns system-critical PriorityClasses to CSI controller/daemon to protect storage functionality during memory pressure.

Thorough audit of every cluster-critical component's priority/QoS (CNI, CSI, CDI, monitoring, control plane, CCM, DNS). Most were already covered by chart or platform defaults: - Cilium agent -> system-node-critical, operator -> system-cluster-critical (chart helper fallback) - Longhorn -> longhorn-critical (chart default on all components) - hcloud-ccm + metrics-server -> system-cluster-critical (chart default) - hcloud-csi -> set in the prior commit - CoreDNS -> system-cluster-critical (docker: explicit; prod: Talos-managed) - control plane (apiserver/etcd/scheduler/controller-manager) -> Talos static pods, inherently exempt from kubelet node-pressure eviction Gaps closed here: - Monitoring stack (kube-prometheus-stack) ran at priority 0. New `platform-critical` PriorityClass (value 1000000000 — above workloads, below the system-* classes so true infra still wins; NOT eviction-exempt, so a runaway Prometheus is still reclaimable before kernel OOM) applied to Prometheus, Alertmanager, the operator, node-exporter, and kube-state-metrics. - CDI control plane ran at priority 0. Set the CDI CR `spec.priorityClass` to the existing `kubevirt-cluster-critical` (CDI is a KubeVirt subproject; that class is already created by the KubeVirt operator). Validation: - ksail workload validate (local + prod): 257 files each. - kubectl kustomize hetzner controllers build: PriorityClass renders; all 5 monitoring priorityClassName refs + CDI priorityClass present. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

…s consumers Address Copilot review feedback on #1667: - Reword eviction comments: node-pressure eviction ranks pods by Priority (critical pods are the *last* eviction candidates), not a strict exemption. Fix the evictionSoft description — it triggers when memory.available stays below the soft threshold (500Mi) for the grace period (1m30s), not "~90s before the hard floor". - Move priority-classes/ to the top of the controllers kustomization so the platform-critical PriorityClass is applied before the HelmReleases (e.g. kube-prometheus-stack) that reference it, avoiding a fresh-reconcile race. No behavioural change to the manifests themselves (comments + resource order). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 30, 2026 07:11

github-project-automation Bot added this to 🌊 Project Board May 30, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 30, 2026

devantler temporarily deployed to ci May 30, 2026 07:11 — with GitHub Actions Inactive

Copilot started reviewing on behalf of devantler May 30, 2026 07:11 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

devantler temporarily deployed to ci May 30, 2026 08:37 — with GitHub Actions Inactive

devantler changed the title ~~feat(node): handle memory pressure gracefully and protect critical kube-system pods~~ feat(node): graceful memory-pressure handling + QoS for all critical components May 30, 2026

devantler marked this pull request as ready for review May 30, 2026 08:52

Copilot AI review requested due to automatic review settings May 30, 2026 08:52

Copilot started reviewing on behalf of devantler May 30, 2026 08:52 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Comment thread k8s/bases/infrastructure/controllers/kustomization.yaml

Comment thread talos/cluster/kubelet.yaml Outdated

botantler Bot approved these changes May 30, 2026

View reviewed changes

botantler Bot enabled auto-merge May 30, 2026 09:06

devantler deployed to ci May 30, 2026 09:07 — with GitHub Actions Active

botantler Bot added this pull request to the merge queue May 30, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(node): graceful memory-pressure handling + QoS for all critical components#1667

feat(node): graceful memory-pressure handling + QoS for all critical components#1667
devantler wants to merge 3 commits into
mainfrom
claude/kubelet-eviction-qos

devantler commented May 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Kubelet memory eviction — talos/cluster/kubelet.yaml (new)

2. QoS / priority for all critical components

New platform-critical PriorityClass (value 1000000000)

Scope / trade-offs

Validation

Follow-ups (not here)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devantler commented May 30, 2026 •

edited

Loading

1. Kubelet memory eviction — `talos/cluster/kubelet.yaml` (new)

New `platform-critical` PriorityClass (value `1000000000`)