fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS by devantler · Pull Request #1649 · devantler-tech/platform

devantler · 2026-05-29T06:04:26Z

🤖 Generated by the Daily AI Assistant

Why

Prod cluster is in a cascading-failure state that's blocking the merge queue (see #1636 for the read-only diagnosis). On 2026-05-28T22:07:32Z cilium-2z7fv on prod-worker-2 was OOMKilled (lastState.terminated.reason=Error exitCode=137). After the agent restart, ClusterIP routing on that node never fully recovered — the cilium-agent is now stuck retrying:

SPIRE Delegate API Client failed to init watcher, retrying:
SPIRE admin socket (/run/spire/sockets/admin.sock) does not exist

because the per-node spire-agent (which creates that socket) is itself crash-looping. Net result: 13 workloads on prod-worker-2 / prod-worker-1 are hard-looping with dial tcp 10.96.0.1:443: i/o timeout and dial tcp 10.96.193.18:8081: i/o timeout:

Workload	Restarts
`cert-manager/cert-manager-cainjector`	86
`kubevirt/virt-handler` (prod-worker-2)	91
All 6× `kube-system/spire-agent-*`	87–93 each
`kube-system/spire-server-0`	(notifier failures, can't write `kube-system/spire-bundle` ConfigMap)
`flux-system/kustomize-controller`	132
`flux-system/flux-operator`	114
`fleetdm/fleet` (prod-worker-1)	157
`keda/keda-add-ons-http-external-scaler` (×2)	151 each
`monitoring/kube-prometheus-stack-kube-state-metrics`	146
`cert-manager/trust-manager`	87
`cert-manager/origin-ca-issuer`	87
`longhorn-system/csi-provisioner`	89

Restart concentration by node: prod-worker-2 = 1072, prod-worker-1 = 400, everyone else ≈ 90 (spire-agent only).

Root cause

The upstream Cilium chart ships every component with resources: {}. With no requests, the kubelet places all of these in BestEffort QoS:

cilium-2z7fv (the agent itself)
6× cilium-envoy-* DaemonSet
2× cilium-operator-*
6× spire-agent-*
spire-server-0

prod-worker-2 sits at 98% memory request commitment (7.45 GB allocatable, 72% actual usage). BestEffort is first in line for OOMKill — the agent gets killed, BPF state on the node never converges, the spire-agent on the node can't recover its admin socket, and cilium-agent is stuck in the SPIRE init-watcher retry loop. Everything that talks to a ClusterIP via that node ends up in CrashLoopBackOff.

What this PR does

Adds explicit requests to the five components so they get Burstable QoS instead of BestEffort. Limits intentionally unset — Cilium recommends against capping the agent.

Component	CPU req	Mem req	Notes
`resources` (agent DaemonSet)	200m	512Mi	Observed steady-state ≈ 165m / 340Mi
`envoy.resources` (cilium-envoy DaemonSet)	50m	128Mi	Modest
`operator.resources`	100m	256Mi	2 replicas, control-plane-pinned
`authentication.mutual.spire.install.server.resources`	50m	128Mi	Single replica
`authentication.mutual.spire.install.agent.resources`	50m	128Mi	DaemonSet

Per-node delta from this PR: ~ 768Mi memory + 300m CPU in DaemonSet requests on every worker (agent + envoy + spire-agent), plus operator/spire-server on a couple of nodes.

What this PR does NOT do

Manual operator step required after merge — Flux applying the new requests does NOT clear the wedged BPF state on prod-worker-2. After the new HelmRelease reconciles:

kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv

The DaemonSet will recreate it with the new resource block and rebuild BPF state cleanly. If prod-worker-2 doesn't recover within ~5 minutes (no improvement in restart counts on workloads scheduled there), reboot the node via Hetzner console / talosctl reboot.

Follow-up risk to flag for the maintainer

prod-worker-2 was already at 98% memory request commitment before this PR. Adding ~768Mi/node of new DaemonSet requests will push it past 100%, which is likely to leave one or more pods unschedulable until either:

VPA recommendations get re-evaluated (some may be over-recommending), or
the worker pool is scaled up / a worker is rebalanced.

This is explicitly out of scope for this fix per the task brief — flagging here so the maintainer can decide whether a cluster update to expand the pool needs to ride with this PR. The status quo (BestEffort, periodic OOMKill, cluster-wide cascade) is materially worse than "tight scheduling headroom" until that's resolved.

Validation

Static-only (no cluster mutation per AGENTS.md):

ksail workload validate → ✔ 256 files validated
ksail --config ksail.prod.yaml workload validate → ✔ 256 files validated
kubectl kustomize k8s/clusters/local/ → ok
kubectl kustomize k8s/clusters/prod/ → ok
kubectl kustomize k8s/providers/hetzner/infrastructure/controllers/ → ok (rendered Cilium HelmRelease has all five resource blocks at the expected paths)
kubectl kustomize k8s/providers/docker/infrastructure/controllers/ → ok (SPIRE resources rendered but inert because spire.enabled: false in the docker overlay; agent/envoy/operator requests apply identically)

Refs

Diagnosis: fix(apps): add startupProbe to homepage, headlamp, actual-budget #1636
Memory: cilium-besteffort-besteffort-oom-cascade (private)
Upstream guidance: https://docs.cilium.io/en/stable/operations/performance/

…ut of BestEffort QoS Cilium agent, operator, envoy, and the embedded SPIRE server/agent run in BestEffort QoS by default — the upstream chart leaves `resources:` empty everywhere. On 2026-05-28 prod-worker-2 (at 98% memory request commitment) OOMKilled `cilium-2z7fv`; the restarted agent left ClusterIP routing on that node degraded, then got stuck retrying `SPIRE admin socket (/run/spire/sockets/admin.sock) does not exist` because the spire-agent DaemonSet pod for the node was also BestEffort and crash-looping. ~13 workloads cascaded into i/o timeout against `10.96.0.1:443` and `10.96.193.18:8081` (cert-manager-cainjector, virt-handler, all spire-agents, spire-server, kustomize-controller, flux-operator, fleet, keda http external scaler, kube-state-metrics, trust-manager, origin-ca-issuer, csi-provisioner). Add explicit requests to: - `resources` (agent DaemonSet) — 200m / 512Mi (observed steady-state ~165m / 340Mi) - `envoy.resources` (standalone cilium-envoy DaemonSet) — 50m / 128Mi - `operator.resources` — 100m / 256Mi - `authentication.mutual.spire.install.server.resources` — 50m / 128Mi - `authentication.mutual.spire.install.agent.resources` — 50m / 128Mi All five pods are now Burstable instead of BestEffort, so they're no longer first in line for kubelet eviction / OOMKill under node memory pressure. Limits intentionally unset — Cilium recommends against capping the agent. Out of scope: prod-worker-2 sits at 98% memory commitment for unrelated reasons (VPA recommendations + workload density). Adding ~768Mi of new DaemonSet requests per node will tip it further; a follow-up rebalance or worker scale-up is likely needed. Flagged in PR body. Recovery action (separate from this PR): once Flux has reconciled the new resources, restart the wedged agent with `kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv`. If prod-worker-2 doesn't recover within ~5 min, reboot the node via talosctl / Hetzner console. Validated with: - ksail workload validate (256 files ok) - ksail --config ksail.prod.yaml workload validate (256 files ok) - kubectl kustomize k8s/clusters/{local,prod}/ — clean - kubectl kustomize k8s/providers/{docker,hetzner}/infrastructure/controllers/ — clean Refs: #1636 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the shared Cilium HelmRelease to add resource requests for Cilium and bundled SPIRE components, moving critical networking/authentication agents out of BestEffort QoS to reduce OOM-related cascading failures.

Changes:

Adds CPU/memory requests for the Cilium operator, agent, and cilium-envoy.
Adds CPU/memory requests for bundled SPIRE agent and server.
Keeps limits unset, consistent with the stated intent to avoid capping Cilium agent performance.

…ps/headlamp Previous run (26621107142) failed in System Test at the kubeconform step with `validation failed: EOF` for `bases/apps/headlamp` — a schema-fetch network blip, not a content failure. Headlamp is untouched by this PR (diff is the Cilium HelmRelease only), the manifest validates cleanly locally on this branch, and the previous successful main-line run validated headlamp from the same files. Empty commit to retrigger CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

botantler · 2026-05-29T12:58:55Z

🎉 This PR is included in version 1.12.4 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Copilot AI review requested due to automatic review settings May 29, 2026 06:04

github-project-automation Bot added this to 🌊 Project Board May 29, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 29, 2026

Copilot started reviewing on behalf of devantler May 29, 2026 06:04 View session

devantler had a problem deploying to ci May 29, 2026 06:04 — with GitHub Actions Failure

devantler marked this pull request as ready for review May 29, 2026 06:05

devantler enabled auto-merge May 29, 2026 06:05

Copilot AI reviewed May 29, 2026

View reviewed changes

botantler Bot approved these changes May 29, 2026

View reviewed changes

devantler temporarily deployed to ci May 29, 2026 06:12 — with GitHub Actions Inactive

devantler added this pull request to the merge queue May 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

devantler merged commit f281197 into main May 29, 2026
9 checks passed

github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 29, 2026

devantler deleted the claude/suspicious-turing-869383 branch May 29, 2026 12:58

botantler Bot added the released label May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS#1649

fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS#1649
devantler merged 2 commits into
mainfrom
claude/suspicious-turing-869383

devantler commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

botantler Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 29, 2026

Why

Root cause

What this PR does

What this PR does NOT do

Follow-up risk to flag for the maintainer

Validation

Refs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

botantler Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants