feat(node): graceful memory-pressure handling + QoS for all critical components#1667
Open
devantler wants to merge 3 commits into
Open
feat(node): graceful memory-pressure handling + QoS for all critical components#1667devantler wants to merge 3 commits into
devantler wants to merge 3 commits into
Conversation
…be-system pods
Two complementary changes so the prod (Hetzner) cluster degrades gracefully
under node memory pressure instead of letting the kernel OOM-killer reap a
node-critical daemon.
1. talos/cluster/kubelet.yaml — kubelet memory eviction config:
- systemReserved/kubeReserved (256Mi each) carve headroom out of
node-allocatable so pods can never starve the OS or the kubelet.
- evictionSoft (memory.available<500Mi, 90s grace) sheds the lowest-priority
pods *before* the hard floor, with evictionMaxPodGracePeriod=60 bounding
the drain and evictionMinimumReclaim=200Mi avoiding immediate re-trigger.
- evictionHard (memory.available<100Mi) stays as the last-resort floor.
Node-pressure eviction is priority-aware and skips critical pods, so
workload pods are always shed before the kube-system control/storage plane.
2. hcloud-csi HelmRelease — assign priority classes (chart default leaves both
unset → priority 0, making them first eviction candidates):
- controller -> system-cluster-critical (provisioning/attach control plane)
- node (DaemonSet) -> system-node-critical (per-node volume mount/unmount)
Critical pods are exempt from node-pressure eviction, so storage keeps
working under memory pressure. (hcloud-ccm already defaults to
system-cluster-critical, so no change there.)
Scope: prod only. The local/CI Docker cluster shares host memory across node
containers, where absolute eviction thresholds could evict pods during the CI
system-test; left out deliberately and can be added if desired.
Validation:
- talosctl gen + machineconfig patch + validate -m cloud --strict on both
controlplane and worker: valid; extraConfig merges as expected.
- ksail --config ksail.prod.yaml workload validate: 256 files validated.
- ksail workload validate (local): 256 files validated (shared build intact).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Introduces production (Hetzner) safeguards to degrade more gracefully under node memory pressure by having the kubelet evict workload pods earlier and by ensuring storage-critical CSI pods are deprioritized for eviction.
Changes:
- Add Talos kubelet
extraConfigto reserve memory and configure soft/hard eviction thresholds formemory.available. - Set
priorityClassNameforhcloud-csicontroller and node components to reduce their likelihood of being evicted under pressure.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
talos/cluster/kubelet.yaml |
Adds kubelet memory reservation + eviction threshold configuration for earlier, priority-aware eviction behavior. |
k8s/providers/hetzner/infrastructure/controllers/hcloud-csi/helm-release.yaml |
Assigns system-critical PriorityClasses to CSI controller/daemon to protect storage functionality during memory pressure. |
Thorough audit of every cluster-critical component's priority/QoS (CNI, CSI,
CDI, monitoring, control plane, CCM, DNS). Most were already covered by chart
or platform defaults:
- Cilium agent -> system-node-critical, operator -> system-cluster-critical
(chart helper fallback)
- Longhorn -> longhorn-critical (chart default on all components)
- hcloud-ccm + metrics-server -> system-cluster-critical (chart default)
- hcloud-csi -> set in the prior commit
- CoreDNS -> system-cluster-critical (docker: explicit; prod: Talos-managed)
- control plane (apiserver/etcd/scheduler/controller-manager) -> Talos static
pods, inherently exempt from kubelet node-pressure eviction
Gaps closed here:
- Monitoring stack (kube-prometheus-stack) ran at priority 0. New
`platform-critical` PriorityClass (value 1000000000 — above workloads,
below the system-* classes so true infra still wins; NOT eviction-exempt,
so a runaway Prometheus is still reclaimable before kernel OOM) applied to
Prometheus, Alertmanager, the operator, node-exporter, and kube-state-metrics.
- CDI control plane ran at priority 0. Set the CDI CR `spec.priorityClass` to
the existing `kubevirt-cluster-critical` (CDI is a KubeVirt subproject; that
class is already created by the KubeVirt operator).
Validation:
- ksail workload validate (local + prod): 257 files each.
- kubectl kustomize hetzner controllers build: PriorityClass renders; all 5
monitoring priorityClassName refs + CDI priorityClass present.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s consumers Address Copilot review feedback on #1667: - Reword eviction comments: node-pressure eviction ranks pods by Priority (critical pods are the *last* eviction candidates), not a strict exemption. Fix the evictionSoft description — it triggers when memory.available stays below the soft threshold (500Mi) for the grace period (1m30s), not "~90s before the hard floor". - Move priority-classes/ to the top of the controllers kustomization so the platform-critical PriorityClass is applied before the HelmReleases (e.g. kube-prometheus-stack) that reference it, avoiding a fresh-reconcile race. No behavioural change to the manifests themselves (comments + resource order). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make the prod (Hetzner) cluster degrade gracefully under node memory pressure, and guarantee every cluster-critical component is last to be evicted.
1. Kubelet memory eviction —
talos/cluster/kubelet.yaml(new)machine.kubelet.extraConfigso a node sheds load early and keeps OS/kubelet headroom instead of hitting the kernel OOM-killer:systemReserved.memory/kubeReserved.memory256MieachevictionSoft.memory.available+ grace500Mi,1m30sevictionMaxPodGracePeriod60evictionMinimumReclaim.memory.available200MievictionHard.memory.available100MiNode-pressure eviction is priority-aware and skips critical pods → workload pods shed first.
2. QoS / priority for all critical components
Audited every component in scope (CNI, CSI, CDI, monitoring, kube-apiserver/etcd/scheduler/controller-manager, cloud-controller-manager, DNS). Most were already protected — this PR closes the real gaps.
system-node-critical/system-cluster-criticalsystem-cluster-critical, node DaemonSet →system-node-critical(chart left both unset)longhorn-critical(all components)system-cluster-criticalsystem-cluster-critical; prod: Talos-managed defaultsystem-cluster-criticalplatform-criticalon Prometheus, Alertmanager, operator, node-exporter, kube-state-metricsspec.priorityClass: kubevirt-cluster-criticalNew
platform-criticalPriorityClass (value1000000000)For important platform add-ons (monitoring) that aren't core k8s/node infra. Sits above normal workloads (evicted last) but below the
system-*classes (2000000000) so true infra still wins — and deliberately notsystem-cluster-critical, which would make Prometheus eviction-exempt and able to drive the node into kernel OOM. Peer of the existingkubevirt-cluster-critical(1000000000). CDI reuseskubevirt-cluster-criticalsince it's a KubeVirt subproject and that class already exists.Scope / trade-offs
talos/); the memory-shared CI/Docker cluster could evict pods during the system-test. Trivial to addtalos-local/parity later.k8s/bases/, so they apply to local + prod and are exercised by the CI system-test.cdi-operatorsingleton reconciler is intentionally left at default priority (not data-path; restarts harmlessly). The CDI control plane (apiserver/controller/uploadproxy) is covered.Validation
talosctl gen→machineconfig patch @talos/cluster/kubelet.yaml→validate -m cloud --strict(controlplane + worker): valid.ksail --config ksail.prod.yaml workload validateandksail workload validate: 257 files each.kubectl kustomize(hetzner controllers):platform-criticalPriorityClass renders; all 5 monitoringpriorityClassNamerefs + CDIpriorityClasspresent.Follow-ups (not here)
talos-local/kubelet parity.