From ca099624268b27a724fd45ab0bf871925bf3ba3d Mon Sep 17 00:00:00 2001
From: Nikolai Emil Damm <nikolaiemildamm@icloud.com>
Date: Fri, 29 May 2026 08:03:40 +0200
Subject: [PATCH 1/2] fix(cilium,spire): set resource requests to promote
 critical agents out of BestEffort QoS
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Cilium agent, operator, envoy, and the embedded SPIRE server/agent run in
BestEffort QoS by default — the upstream chart leaves `resources:` empty
everywhere. On 2026-05-28 prod-worker-2 (at 98% memory request commitment)
OOMKilled `cilium-2z7fv`; the restarted agent left ClusterIP routing on
that node degraded, then got stuck retrying `SPIRE admin socket
(/run/spire/sockets/admin.sock) does not exist` because the spire-agent
DaemonSet pod for the node was also BestEffort and crash-looping. ~13
workloads cascaded into i/o timeout against `10.96.0.1:443` and
`10.96.193.18:8081` (cert-manager-cainjector, virt-handler, all
spire-agents, spire-server, kustomize-controller, flux-operator, fleet,
keda http external scaler, kube-state-metrics, trust-manager,
origin-ca-issuer, csi-provisioner).

Add explicit requests to:
- `resources` (agent DaemonSet) — 200m / 512Mi (observed steady-state
  ~165m / 340Mi)
- `envoy.resources` (standalone cilium-envoy DaemonSet) — 50m / 128Mi
- `operator.resources` — 100m / 256Mi
- `authentication.mutual.spire.install.server.resources` — 50m / 128Mi
- `authentication.mutual.spire.install.agent.resources` — 50m / 128Mi

All five pods are now Burstable instead of BestEffort, so they're no
longer first in line for kubelet eviction / OOMKill under node memory
pressure. Limits intentionally unset — Cilium recommends against capping
the agent.

Out of scope: prod-worker-2 sits at 98% memory commitment for unrelated
reasons (VPA recommendations + workload density). Adding ~768Mi of new
DaemonSet requests per node will tip it further; a follow-up rebalance
or worker scale-up is likely needed. Flagged in PR body.

Recovery action (separate from this PR): once Flux has reconciled the
new resources, restart the wedged agent with
`kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv`.
If prod-worker-2 doesn't recover within ~5 min, reboot the node via
talosctl / Hetzner console.

Validated with:
- ksail workload validate (256 files ok)
- ksail --config ksail.prod.yaml workload validate (256 files ok)
- kubectl kustomize k8s/clusters/{local,prod}/ — clean
- kubectl kustomize k8s/providers/{docker,hetzner}/infrastructure/controllers/ — clean

Refs: #1636

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../controllers/cilium/helm-release.yaml      | 37 +++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/k8s/bases/infrastructure/controllers/cilium/helm-release.yaml b/k8s/bases/infrastructure/controllers/cilium/helm-release.yaml
index e8184ccf3..3c57fc71d 100644
--- a/k8s/bases/infrastructure/controllers/cilium/helm-release.yaml
+++ b/k8s/bases/infrastructure/controllers/cilium/helm-release.yaml
@@ -79,6 +79,10 @@ spec:
           enabled: false
     operator:
       replicas: ${cilium_replicas:=2}
+      resources:
+        requests:
+          cpu: 100m
+          memory: 256Mi
       podDisruptionBudget:
         enabled: true
         minAvailable: 1
@@ -94,6 +98,24 @@ spec:
     ipam:
       mode: kubernetes
     kubeProxyReplacement: true
+    # ------------------------------------------------------------------
+    # Resource requests for the agent DaemonSet and the standalone
+    # cilium-envoy DaemonSet. These promote the pods out of BestEffort
+    # QoS so they survive node memory pressure; an OOMKilled cilium-agent
+    # leaves BPF state degraded and the node loses ClusterIP routing
+    # (observed cascading into ~13 workload crash-loops on prod-worker-2,
+    # 2026-05-28). Limits intentionally unset — Cilium recommends against
+    # capping the agent (https://docs.cilium.io/en/stable/operations/performance/).
+    # ------------------------------------------------------------------
+    resources:
+      requests:
+        cpu: 200m
+        memory: 512Mi
+    envoy:
+      resources:
+        requests:
+          cpu: 50m
+          memory: 128Mi
     # Transparent WireGuard encryption for all pod-to-pod and node-to-node
     # traffic.  KubeSpan (Talos-layer WireGuard between nodes) is not
     # enabled in this cluster, so without this setting inter-node pod
@@ -118,8 +140,23 @@ spec:
           install:
             namespace: kube-system
             existingNamespace: true
+            # Resource requests promote spire-server and spire-agent pods
+            # out of BestEffort QoS. cilium-agent's SPIRE Delegate API
+            # client relies on the per-node spire-agent admin socket — if
+            # the agent is evicted/OOMKilled the cilium-agent on that
+            # node stays stuck retrying "SPIRE admin socket does not
+            # exist" and ClusterIP routing degrades alongside it.
+            agent:
+              resources:
+                requests:
+                  cpu: 50m
+                  memory: 128Mi
             # TODO: Remove workaround when SPIRE no longer fails to start (https://github.com/cilium/cilium/issues/40533)
             server:
+              resources:
+                requests:
+                  cpu: 50m
+                  memory: 128Mi
               initContainers:
               - command:
                 - /bin/sh

From 0bad5534c606bb04aeb02faa242b796252e25812 Mon Sep 17 00:00:00 2001
From: Nikolai Emil Damm <nikolaiemildamm@icloud.com>
Date: Fri, 29 May 2026 08:12:34 +0200
Subject: [PATCH 2/2] ci: retrigger after transient ksail-workload-validate EOF
 on bases/apps/headlamp
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previous run (26621107142) failed in System Test at the kubeconform
step with `validation failed: EOF` for `bases/apps/headlamp` — a
schema-fetch network blip, not a content failure. Headlamp is
untouched by this PR (diff is the Cilium HelmRelease only), the
manifest validates cleanly locally on this branch, and the previous
successful main-line run validated headlamp from the same files.
Empty commit to retrigger CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>