fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS#1649
Merged
Merged
Conversation
…ut of BestEffort QoS
Cilium agent, operator, envoy, and the embedded SPIRE server/agent run in
BestEffort QoS by default — the upstream chart leaves `resources:` empty
everywhere. On 2026-05-28 prod-worker-2 (at 98% memory request commitment)
OOMKilled `cilium-2z7fv`; the restarted agent left ClusterIP routing on
that node degraded, then got stuck retrying `SPIRE admin socket
(/run/spire/sockets/admin.sock) does not exist` because the spire-agent
DaemonSet pod for the node was also BestEffort and crash-looping. ~13
workloads cascaded into i/o timeout against `10.96.0.1:443` and
`10.96.193.18:8081` (cert-manager-cainjector, virt-handler, all
spire-agents, spire-server, kustomize-controller, flux-operator, fleet,
keda http external scaler, kube-state-metrics, trust-manager,
origin-ca-issuer, csi-provisioner).
Add explicit requests to:
- `resources` (agent DaemonSet) — 200m / 512Mi (observed steady-state
~165m / 340Mi)
- `envoy.resources` (standalone cilium-envoy DaemonSet) — 50m / 128Mi
- `operator.resources` — 100m / 256Mi
- `authentication.mutual.spire.install.server.resources` — 50m / 128Mi
- `authentication.mutual.spire.install.agent.resources` — 50m / 128Mi
All five pods are now Burstable instead of BestEffort, so they're no
longer first in line for kubelet eviction / OOMKill under node memory
pressure. Limits intentionally unset — Cilium recommends against capping
the agent.
Out of scope: prod-worker-2 sits at 98% memory commitment for unrelated
reasons (VPA recommendations + workload density). Adding ~768Mi of new
DaemonSet requests per node will tip it further; a follow-up rebalance
or worker scale-up is likely needed. Flagged in PR body.
Recovery action (separate from this PR): once Flux has reconciled the
new resources, restart the wedged agent with
`kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv`.
If prod-worker-2 doesn't recover within ~5 min, reboot the node via
talosctl / Hetzner console.
Validated with:
- ksail workload validate (256 files ok)
- ksail --config ksail.prod.yaml workload validate (256 files ok)
- kubectl kustomize k8s/clusters/{local,prod}/ — clean
- kubectl kustomize k8s/providers/{docker,hetzner}/infrastructure/controllers/ — clean
Refs: #1636
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the shared Cilium HelmRelease to add resource requests for Cilium and bundled SPIRE components, moving critical networking/authentication agents out of BestEffort QoS to reduce OOM-related cascading failures.
Changes:
- Adds CPU/memory requests for the Cilium operator, agent, and cilium-envoy.
- Adds CPU/memory requests for bundled SPIRE agent and server.
- Keeps limits unset, consistent with the stated intent to avoid capping Cilium agent performance.
…ps/headlamp Previous run (26621107142) failed in System Test at the kubeconform step with `validation failed: EOF` for `bases/apps/headlamp` — a schema-fetch network blip, not a content failure. Headlamp is untouched by this PR (diff is the Cilium HelmRelease only), the manifest validates cleanly locally on this branch, and the previous successful main-line run validated headlamp from the same files. Empty commit to retrigger CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
🎉 This PR is included in version 1.12.4 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Prod cluster is in a cascading-failure state that's blocking the merge queue (see #1636 for the read-only diagnosis). On 2026-05-28T22:07:32Z
cilium-2z7fvonprod-worker-2was OOMKilled (lastState.terminated.reason=Error exitCode=137). After the agent restart, ClusterIP routing on that node never fully recovered — the cilium-agent is now stuck retrying:because the per-node
spire-agent(which creates that socket) is itself crash-looping. Net result: 13 workloads on prod-worker-2 / prod-worker-1 are hard-looping withdial tcp 10.96.0.1:443: i/o timeoutanddial tcp 10.96.193.18:8081: i/o timeout:cert-manager/cert-manager-cainjectorkubevirt/virt-handler(prod-worker-2)kube-system/spire-agent-*kube-system/spire-server-0kube-system/spire-bundleConfigMap)flux-system/kustomize-controllerflux-system/flux-operatorfleetdm/fleet(prod-worker-1)keda/keda-add-ons-http-external-scaler(×2)monitoring/kube-prometheus-stack-kube-state-metricscert-manager/trust-managercert-manager/origin-ca-issuerlonghorn-system/csi-provisionerRestart concentration by node: prod-worker-2 = 1072, prod-worker-1 = 400, everyone else ≈ 90 (spire-agent only).
Root cause
The upstream Cilium chart ships every component with
resources: {}. With no requests, the kubelet places all of these in BestEffort QoS:cilium-2z7fv(the agent itself)cilium-envoy-*DaemonSetcilium-operator-*spire-agent-*spire-server-0prod-worker-2 sits at 98% memory request commitment (7.45 GB allocatable, 72% actual usage). BestEffort is first in line for OOMKill — the agent gets killed, BPF state on the node never converges, the spire-agent on the node can't recover its admin socket, and cilium-agent is stuck in the SPIRE init-watcher retry loop. Everything that talks to a ClusterIP via that node ends up in CrashLoopBackOff.
What this PR does
Adds explicit
requeststo the five components so they get Burstable QoS instead of BestEffort. Limits intentionally unset — Cilium recommends against capping the agent.resources(agent DaemonSet)envoy.resources(cilium-envoy DaemonSet)operator.resourcesauthentication.mutual.spire.install.server.resourcesauthentication.mutual.spire.install.agent.resourcesPer-node delta from this PR: ~ 768Mi memory + 300m CPU in DaemonSet requests on every worker (agent + envoy + spire-agent), plus operator/spire-server on a couple of nodes.
What this PR does NOT do
Manual operator step required after merge — Flux applying the new requests does NOT clear the wedged BPF state on prod-worker-2. After the new HelmRelease reconciles:
The DaemonSet will recreate it with the new resource block and rebuild BPF state cleanly. If prod-worker-2 doesn't recover within ~5 minutes (no improvement in restart counts on workloads scheduled there), reboot the node via Hetzner console /
talosctl reboot.Follow-up risk to flag for the maintainer
prod-worker-2 was already at 98% memory request commitment before this PR. Adding ~768Mi/node of new DaemonSet requests will push it past 100%, which is likely to leave one or more pods unschedulable until either:
This is explicitly out of scope for this fix per the task brief — flagging here so the maintainer can decide whether a
cluster updateto expand the pool needs to ride with this PR. The status quo (BestEffort, periodic OOMKill, cluster-wide cascade) is materially worse than "tight scheduling headroom" until that's resolved.Validation
Static-only (no cluster mutation per AGENTS.md):
ksail workload validate→ ✔ 256 files validatedksail --config ksail.prod.yaml workload validate→ ✔ 256 files validatedkubectl kustomize k8s/clusters/local/→ okkubectl kustomize k8s/clusters/prod/→ okkubectl kustomize k8s/providers/hetzner/infrastructure/controllers/→ ok (rendered Cilium HelmRelease has all five resource blocks at the expected paths)kubectl kustomize k8s/providers/docker/infrastructure/controllers/→ ok (SPIRE resources rendered but inert becausespire.enabled: falsein the docker overlay; agent/envoy/operator requests apply identically)Refs
cilium-besteffort-besteffort-oom-cascade(private)