Skip to content

feat(node): graceful memory-pressure handling + QoS for all critical components#1667

Open
devantler wants to merge 3 commits into
mainfrom
claude/kubelet-eviction-qos
Open

feat(node): graceful memory-pressure handling + QoS for all critical components#1667
devantler wants to merge 3 commits into
mainfrom
claude/kubelet-eviction-qos

Conversation

@devantler
Copy link
Copy Markdown
Contributor

@devantler devantler commented May 30, 2026

🤖 Generated by the Daily AI Assistant

Make the prod (Hetzner) cluster degrade gracefully under node memory pressure, and guarantee every cluster-critical component is last to be evicted.

1. Kubelet memory eviction — talos/cluster/kubelet.yaml (new)

machine.kubelet.extraConfig so a node sheds load early and keeps OS/kubelet headroom instead of hitting the kernel OOM-killer:

Setting Value Purpose
systemReserved.memory / kubeReserved.memory 256Mi each Reserve out of allocatable so pods can't starve the OS/kubelet.
evictionSoft.memory.available + grace 500Mi, 1m30s Evict lowest-priority pods cleanly before it's critical.
evictionMaxPodGracePeriod 60 Bound the soft-eviction drain.
evictionMinimumReclaim.memory.available 200Mi Avoid immediate re-trigger.
evictionHard.memory.available 100Mi Last-resort floor.

Node-pressure eviction is priority-aware and skips critical pods → workload pods shed first.

2. QoS / priority for all critical components

Audited every component in scope (CNI, CSI, CDI, monitoring, kube-apiserver/etcd/scheduler/controller-manager, cloud-controller-manager, DNS). Most were already protected — this PR closes the real gaps.

Component Status Priority class
Control plane (apiserver/etcd/scheduler/controller-manager) ✅ already Talos static pods — inherently exempt from kubelet eviction
CNI — Cilium agent / operator ✅ already chart helper → system-node-critical / system-cluster-critical
CSI — hcloud-csi this PR controller → system-cluster-critical, node DaemonSet → system-node-critical (chart left both unset)
CSI — Longhorn ✅ already chart default → longhorn-critical (all components)
CCM — hcloud-ccm ✅ already chart default → system-cluster-critical
DNS — CoreDNS ✅ already docker: explicit system-cluster-critical; prod: Talos-managed default
metrics-server ✅ already chart default → system-cluster-critical
Monitoring — kube-prometheus-stack this PR ran at priority 0 → new platform-critical on Prometheus, Alertmanager, operator, node-exporter, kube-state-metrics
CDI this PR ran at priority 0 → CR spec.priorityClass: kubevirt-cluster-critical

New platform-critical PriorityClass (value 1000000000)

For important platform add-ons (monitoring) that aren't core k8s/node infra. Sits above normal workloads (evicted last) but below the system-* classes (2000000000) so true infra still wins — and deliberately not system-cluster-critical, which would make Prometheus eviction-exempt and able to drive the node into kernel OOM. Peer of the existing kubevirt-cluster-critical (1000000000). CDI reuses kubevirt-cluster-critical since it's a KubeVirt subproject and that class already exists.

Scope / trade-offs

  • Kubelet eviction config is prod only (talos/); the memory-shared CI/Docker cluster could evict pods during the system-test. Trivial to add talos-local/ parity later.
  • Monitoring/CDI priority changes live in k8s/bases/, so they apply to local + prod and are exercised by the CI system-test.
  • The cdi-operator singleton reconciler is intentionally left at default priority (not data-path; restarts harmlessly). The CDI control plane (apiserver/controller/uploadproxy) is covered.

⚠️ Kubelet config applies on the next ksail --config ksail.prod.yaml cluster update; the priority classes apply on the next Flux reconcile. No effect from merging alone.

Validation

  • talosctl genmachineconfig patch @talos/cluster/kubelet.yamlvalidate -m cloud --strict (controlplane + worker): valid.
  • ksail --config ksail.prod.yaml workload validate and ksail workload validate: 257 files each.
  • kubectl kustomize (hetzner controllers): platform-critical PriorityClass renders; all 5 monitoring priorityClassName refs + CDI priorityClass present.
  • Full Talos+Docker system-test runs in CI on this PR.

Follow-ups (not here)

  • Optional talos-local/ kubelet parity.

…be-system pods

Two complementary changes so the prod (Hetzner) cluster degrades gracefully
under node memory pressure instead of letting the kernel OOM-killer reap a
node-critical daemon.

1. talos/cluster/kubelet.yaml — kubelet memory eviction config:
   - systemReserved/kubeReserved (256Mi each) carve headroom out of
     node-allocatable so pods can never starve the OS or the kubelet.
   - evictionSoft (memory.available<500Mi, 90s grace) sheds the lowest-priority
     pods *before* the hard floor, with evictionMaxPodGracePeriod=60 bounding
     the drain and evictionMinimumReclaim=200Mi avoiding immediate re-trigger.
   - evictionHard (memory.available<100Mi) stays as the last-resort floor.
   Node-pressure eviction is priority-aware and skips critical pods, so
   workload pods are always shed before the kube-system control/storage plane.

2. hcloud-csi HelmRelease — assign priority classes (chart default leaves both
   unset → priority 0, making them first eviction candidates):
   - controller -> system-cluster-critical (provisioning/attach control plane)
   - node (DaemonSet) -> system-node-critical (per-node volume mount/unmount)
   Critical pods are exempt from node-pressure eviction, so storage keeps
   working under memory pressure. (hcloud-ccm already defaults to
   system-cluster-critical, so no change there.)

Scope: prod only. The local/CI Docker cluster shares host memory across node
containers, where absolute eviction thresholds could evict pods during the CI
system-test; left out deliberately and can be added if desired.

Validation:
- talosctl gen + machineconfig patch + validate -m cloud --strict on both
  controlplane and worker: valid; extraConfig merges as expected.
- ksail --config ksail.prod.yaml workload validate: 256 files validated.
- ksail workload validate (local): 256 files validated (shared build intact).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces production (Hetzner) safeguards to degrade more gracefully under node memory pressure by having the kubelet evict workload pods earlier and by ensuring storage-critical CSI pods are deprioritized for eviction.

Changes:

  • Add Talos kubelet extraConfig to reserve memory and configure soft/hard eviction thresholds for memory.available.
  • Set priorityClassName for hcloud-csi controller and node components to reduce their likelihood of being evicted under pressure.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
talos/cluster/kubelet.yaml Adds kubelet memory reservation + eviction threshold configuration for earlier, priority-aware eviction behavior.
k8s/providers/hetzner/infrastructure/controllers/hcloud-csi/helm-release.yaml Assigns system-critical PriorityClasses to CSI controller/daemon to protect storage functionality during memory pressure.

Comment thread talos/cluster/kubelet.yaml Outdated
Comment thread talos/cluster/kubelet.yaml Outdated
Comment thread k8s/providers/hetzner/infrastructure/controllers/hcloud-csi/helm-release.yaml Outdated
Comment thread k8s/providers/hetzner/infrastructure/controllers/hcloud-csi/helm-release.yaml Outdated
Thorough audit of every cluster-critical component's priority/QoS (CNI, CSI,
CDI, monitoring, control plane, CCM, DNS). Most were already covered by chart
or platform defaults:
  - Cilium agent -> system-node-critical, operator -> system-cluster-critical
    (chart helper fallback)
  - Longhorn -> longhorn-critical (chart default on all components)
  - hcloud-ccm + metrics-server -> system-cluster-critical (chart default)
  - hcloud-csi -> set in the prior commit
  - CoreDNS -> system-cluster-critical (docker: explicit; prod: Talos-managed)
  - control plane (apiserver/etcd/scheduler/controller-manager) -> Talos static
    pods, inherently exempt from kubelet node-pressure eviction

Gaps closed here:
  - Monitoring stack (kube-prometheus-stack) ran at priority 0. New
    `platform-critical` PriorityClass (value 1000000000 — above workloads,
    below the system-* classes so true infra still wins; NOT eviction-exempt,
    so a runaway Prometheus is still reclaimable before kernel OOM) applied to
    Prometheus, Alertmanager, the operator, node-exporter, and kube-state-metrics.
  - CDI control plane ran at priority 0. Set the CDI CR `spec.priorityClass` to
    the existing `kubevirt-cluster-critical` (CDI is a KubeVirt subproject; that
    class is already created by the KubeVirt operator).

Validation:
- ksail workload validate (local + prod): 257 files each.
- kubectl kustomize hetzner controllers build: PriorityClass renders; all 5
  monitoring priorityClassName refs + CDI priorityClass present.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@devantler devantler changed the title feat(node): handle memory pressure gracefully and protect critical kube-system pods feat(node): graceful memory-pressure handling + QoS for all critical components May 30, 2026
@devantler devantler marked this pull request as ready for review May 30, 2026 08:52
Copilot AI review requested due to automatic review settings May 30, 2026 08:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread k8s/bases/infrastructure/controllers/kustomization.yaml
Comment thread talos/cluster/kubelet.yaml Outdated
…s consumers

Address Copilot review feedback on #1667:
- Reword eviction comments: node-pressure eviction ranks pods by Priority
  (critical pods are the *last* eviction candidates), not a strict exemption.
  Fix the evictionSoft description — it triggers when memory.available stays
  below the soft threshold (500Mi) for the grace period (1m30s), not "~90s
  before the hard floor".
- Move priority-classes/ to the top of the controllers kustomization so the
  platform-critical PriorityClass is applied before the HelmReleases (e.g.
  kube-prometheus-stack) that reference it, avoiding a fresh-reconcile race.

No behavioural change to the manifests themselves (comments + resource order).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@botantler botantler Bot enabled auto-merge May 30, 2026 09:06
@botantler botantler Bot added this pull request to the merge queue May 30, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants