From ca099624268b27a724fd45ab0bf871925bf3ba3d Mon Sep 17 00:00:00 2001 From: Nikolai Emil Damm Date: Fri, 29 May 2026 08:03:40 +0200 Subject: [PATCH 1/2] fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cilium agent, operator, envoy, and the embedded SPIRE server/agent run in BestEffort QoS by default — the upstream chart leaves `resources:` empty everywhere. On 2026-05-28 prod-worker-2 (at 98% memory request commitment) OOMKilled `cilium-2z7fv`; the restarted agent left ClusterIP routing on that node degraded, then got stuck retrying `SPIRE admin socket (/run/spire/sockets/admin.sock) does not exist` because the spire-agent DaemonSet pod for the node was also BestEffort and crash-looping. ~13 workloads cascaded into i/o timeout against `10.96.0.1:443` and `10.96.193.18:8081` (cert-manager-cainjector, virt-handler, all spire-agents, spire-server, kustomize-controller, flux-operator, fleet, keda http external scaler, kube-state-metrics, trust-manager, origin-ca-issuer, csi-provisioner). Add explicit requests to: - `resources` (agent DaemonSet) — 200m / 512Mi (observed steady-state ~165m / 340Mi) - `envoy.resources` (standalone cilium-envoy DaemonSet) — 50m / 128Mi - `operator.resources` — 100m / 256Mi - `authentication.mutual.spire.install.server.resources` — 50m / 128Mi - `authentication.mutual.spire.install.agent.resources` — 50m / 128Mi All five pods are now Burstable instead of BestEffort, so they're no longer first in line for kubelet eviction / OOMKill under node memory pressure. Limits intentionally unset — Cilium recommends against capping the agent. Out of scope: prod-worker-2 sits at 98% memory commitment for unrelated reasons (VPA recommendations + workload density). Adding ~768Mi of new DaemonSet requests per node will tip it further; a follow-up rebalance or worker scale-up is likely needed. Flagged in PR body. Recovery action (separate from this PR): once Flux has reconciled the new resources, restart the wedged agent with `kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv`. If prod-worker-2 doesn't recover within ~5 min, reboot the node via talosctl / Hetzner console. Validated with: - ksail workload validate (256 files ok) - ksail --config ksail.prod.yaml workload validate (256 files ok) - kubectl kustomize k8s/clusters/{local,prod}/ — clean - kubectl kustomize k8s/providers/{docker,hetzner}/infrastructure/controllers/ — clean Refs: #1636 Co-Authored-By: Claude Opus 4.7 (1M context) --- .../controllers/cilium/helm-release.yaml | 37 +++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/k8s/bases/infrastructure/controllers/cilium/helm-release.yaml b/k8s/bases/infrastructure/controllers/cilium/helm-release.yaml index e8184ccf3..3c57fc71d 100644 --- a/k8s/bases/infrastructure/controllers/cilium/helm-release.yaml +++ b/k8s/bases/infrastructure/controllers/cilium/helm-release.yaml @@ -79,6 +79,10 @@ spec: enabled: false operator: replicas: ${cilium_replicas:=2} + resources: + requests: + cpu: 100m + memory: 256Mi podDisruptionBudget: enabled: true minAvailable: 1 @@ -94,6 +98,24 @@ spec: ipam: mode: kubernetes kubeProxyReplacement: true + # ------------------------------------------------------------------ + # Resource requests for the agent DaemonSet and the standalone + # cilium-envoy DaemonSet. These promote the pods out of BestEffort + # QoS so they survive node memory pressure; an OOMKilled cilium-agent + # leaves BPF state degraded and the node loses ClusterIP routing + # (observed cascading into ~13 workload crash-loops on prod-worker-2, + # 2026-05-28). Limits intentionally unset — Cilium recommends against + # capping the agent (https://docs.cilium.io/en/stable/operations/performance/). + # ------------------------------------------------------------------ + resources: + requests: + cpu: 200m + memory: 512Mi + envoy: + resources: + requests: + cpu: 50m + memory: 128Mi # Transparent WireGuard encryption for all pod-to-pod and node-to-node # traffic. KubeSpan (Talos-layer WireGuard between nodes) is not # enabled in this cluster, so without this setting inter-node pod @@ -118,8 +140,23 @@ spec: install: namespace: kube-system existingNamespace: true + # Resource requests promote spire-server and spire-agent pods + # out of BestEffort QoS. cilium-agent's SPIRE Delegate API + # client relies on the per-node spire-agent admin socket — if + # the agent is evicted/OOMKilled the cilium-agent on that + # node stays stuck retrying "SPIRE admin socket does not + # exist" and ClusterIP routing degrades alongside it. + agent: + resources: + requests: + cpu: 50m + memory: 128Mi # TODO: Remove workaround when SPIRE no longer fails to start (https://github.com/cilium/cilium/issues/40533) server: + resources: + requests: + cpu: 50m + memory: 128Mi initContainers: - command: - /bin/sh From 0bad5534c606bb04aeb02faa242b796252e25812 Mon Sep 17 00:00:00 2001 From: Nikolai Emil Damm Date: Fri, 29 May 2026 08:12:34 +0200 Subject: [PATCH 2/2] ci: retrigger after transient ksail-workload-validate EOF on bases/apps/headlamp MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previous run (26621107142) failed in System Test at the kubeconform step with `validation failed: EOF` for `bases/apps/headlamp` — a schema-fetch network blip, not a content failure. Headlamp is untouched by this PR (diff is the Cilium HelmRelease only), the manifest validates cleanly locally on this branch, and the previous successful main-line run validated headlamp from the same files. Empty commit to retrigger CI. Co-Authored-By: Claude Opus 4.7 (1M context)