Skip to content

fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS#1649

Merged
devantler merged 2 commits into
mainfrom
claude/suspicious-turing-869383
May 29, 2026
Merged

fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS#1649
devantler merged 2 commits into
mainfrom
claude/suspicious-turing-869383

Conversation

@devantler
Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Why

Prod cluster is in a cascading-failure state that's blocking the merge queue (see #1636 for the read-only diagnosis). On 2026-05-28T22:07:32Z cilium-2z7fv on prod-worker-2 was OOMKilled (lastState.terminated.reason=Error exitCode=137). After the agent restart, ClusterIP routing on that node never fully recovered — the cilium-agent is now stuck retrying:

SPIRE Delegate API Client failed to init watcher, retrying:
SPIRE admin socket (/run/spire/sockets/admin.sock) does not exist

because the per-node spire-agent (which creates that socket) is itself crash-looping. Net result: 13 workloads on prod-worker-2 / prod-worker-1 are hard-looping with dial tcp 10.96.0.1:443: i/o timeout and dial tcp 10.96.193.18:8081: i/o timeout:

Workload Restarts
cert-manager/cert-manager-cainjector 86
kubevirt/virt-handler (prod-worker-2) 91
All 6× kube-system/spire-agent-* 87–93 each
kube-system/spire-server-0 (notifier failures, can't write kube-system/spire-bundle ConfigMap)
flux-system/kustomize-controller 132
flux-system/flux-operator 114
fleetdm/fleet (prod-worker-1) 157
keda/keda-add-ons-http-external-scaler (×2) 151 each
monitoring/kube-prometheus-stack-kube-state-metrics 146
cert-manager/trust-manager 87
cert-manager/origin-ca-issuer 87
longhorn-system/csi-provisioner 89

Restart concentration by node: prod-worker-2 = 1072, prod-worker-1 = 400, everyone else ≈ 90 (spire-agent only).

Root cause

The upstream Cilium chart ships every component with resources: {}. With no requests, the kubelet places all of these in BestEffort QoS:

  • cilium-2z7fv (the agent itself)
  • cilium-envoy-* DaemonSet
  • cilium-operator-*
  • spire-agent-*
  • spire-server-0

prod-worker-2 sits at 98% memory request commitment (7.45 GB allocatable, 72% actual usage). BestEffort is first in line for OOMKill — the agent gets killed, BPF state on the node never converges, the spire-agent on the node can't recover its admin socket, and cilium-agent is stuck in the SPIRE init-watcher retry loop. Everything that talks to a ClusterIP via that node ends up in CrashLoopBackOff.

What this PR does

Adds explicit requests to the five components so they get Burstable QoS instead of BestEffort. Limits intentionally unset — Cilium recommends against capping the agent.

Component CPU req Mem req Notes
resources (agent DaemonSet) 200m 512Mi Observed steady-state ≈ 165m / 340Mi
envoy.resources (cilium-envoy DaemonSet) 50m 128Mi Modest
operator.resources 100m 256Mi 2 replicas, control-plane-pinned
authentication.mutual.spire.install.server.resources 50m 128Mi Single replica
authentication.mutual.spire.install.agent.resources 50m 128Mi DaemonSet

Per-node delta from this PR: ~ 768Mi memory + 300m CPU in DaemonSet requests on every worker (agent + envoy + spire-agent), plus operator/spire-server on a couple of nodes.

What this PR does NOT do

Manual operator step required after merge — Flux applying the new requests does NOT clear the wedged BPF state on prod-worker-2. After the new HelmRelease reconciles:

kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv

The DaemonSet will recreate it with the new resource block and rebuild BPF state cleanly. If prod-worker-2 doesn't recover within ~5 minutes (no improvement in restart counts on workloads scheduled there), reboot the node via Hetzner console / talosctl reboot.

Follow-up risk to flag for the maintainer

prod-worker-2 was already at 98% memory request commitment before this PR. Adding ~768Mi/node of new DaemonSet requests will push it past 100%, which is likely to leave one or more pods unschedulable until either:

  • VPA recommendations get re-evaluated (some may be over-recommending), or
  • the worker pool is scaled up / a worker is rebalanced.

This is explicitly out of scope for this fix per the task brief — flagging here so the maintainer can decide whether a cluster update to expand the pool needs to ride with this PR. The status quo (BestEffort, periodic OOMKill, cluster-wide cascade) is materially worse than "tight scheduling headroom" until that's resolved.

Validation

Static-only (no cluster mutation per AGENTS.md):

  • ksail workload validate → ✔ 256 files validated
  • ksail --config ksail.prod.yaml workload validate → ✔ 256 files validated
  • kubectl kustomize k8s/clusters/local/ → ok
  • kubectl kustomize k8s/clusters/prod/ → ok
  • kubectl kustomize k8s/providers/hetzner/infrastructure/controllers/ → ok (rendered Cilium HelmRelease has all five resource blocks at the expected paths)
  • kubectl kustomize k8s/providers/docker/infrastructure/controllers/ → ok (SPIRE resources rendered but inert because spire.enabled: false in the docker overlay; agent/envoy/operator requests apply identically)

Refs

…ut of BestEffort QoS

Cilium agent, operator, envoy, and the embedded SPIRE server/agent run in
BestEffort QoS by default — the upstream chart leaves `resources:` empty
everywhere. On 2026-05-28 prod-worker-2 (at 98% memory request commitment)
OOMKilled `cilium-2z7fv`; the restarted agent left ClusterIP routing on
that node degraded, then got stuck retrying `SPIRE admin socket
(/run/spire/sockets/admin.sock) does not exist` because the spire-agent
DaemonSet pod for the node was also BestEffort and crash-looping. ~13
workloads cascaded into i/o timeout against `10.96.0.1:443` and
`10.96.193.18:8081` (cert-manager-cainjector, virt-handler, all
spire-agents, spire-server, kustomize-controller, flux-operator, fleet,
keda http external scaler, kube-state-metrics, trust-manager,
origin-ca-issuer, csi-provisioner).

Add explicit requests to:
- `resources` (agent DaemonSet) — 200m / 512Mi (observed steady-state
  ~165m / 340Mi)
- `envoy.resources` (standalone cilium-envoy DaemonSet) — 50m / 128Mi
- `operator.resources` — 100m / 256Mi
- `authentication.mutual.spire.install.server.resources` — 50m / 128Mi
- `authentication.mutual.spire.install.agent.resources` — 50m / 128Mi

All five pods are now Burstable instead of BestEffort, so they're no
longer first in line for kubelet eviction / OOMKill under node memory
pressure. Limits intentionally unset — Cilium recommends against capping
the agent.

Out of scope: prod-worker-2 sits at 98% memory commitment for unrelated
reasons (VPA recommendations + workload density). Adding ~768Mi of new
DaemonSet requests per node will tip it further; a follow-up rebalance
or worker scale-up is likely needed. Flagged in PR body.

Recovery action (separate from this PR): once Flux has reconciled the
new resources, restart the wedged agent with
`kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv`.
If prod-worker-2 doesn't recover within ~5 min, reboot the node via
talosctl / Hetzner console.

Validated with:
- ksail workload validate (256 files ok)
- ksail --config ksail.prod.yaml workload validate (256 files ok)
- kubectl kustomize k8s/clusters/{local,prod}/ — clean
- kubectl kustomize k8s/providers/{docker,hetzner}/infrastructure/controllers/ — clean

Refs: #1636

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 06:04
@devantler devantler marked this pull request as ready for review May 29, 2026 06:05
@devantler devantler enabled auto-merge May 29, 2026 06:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the shared Cilium HelmRelease to add resource requests for Cilium and bundled SPIRE components, moving critical networking/authentication agents out of BestEffort QoS to reduce OOM-related cascading failures.

Changes:

  • Adds CPU/memory requests for the Cilium operator, agent, and cilium-envoy.
  • Adds CPU/memory requests for bundled SPIRE agent and server.
  • Keeps limits unset, consistent with the stated intent to avoid capping Cilium agent performance.

…ps/headlamp

Previous run (26621107142) failed in System Test at the kubeconform
step with `validation failed: EOF` for `bases/apps/headlamp` — a
schema-fetch network blip, not a content failure. Headlamp is
untouched by this PR (diff is the Cilium HelmRelease only), the
manifest validates cleanly locally on this branch, and the previous
successful main-line run validated headlamp from the same files.
Empty commit to retrigger CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@devantler devantler added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@devantler devantler merged commit f281197 into main May 29, 2026
9 checks passed
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 29, 2026
@devantler devantler deleted the claude/suspicious-turing-869383 branch May 29, 2026 12:58
@botantler
Copy link
Copy Markdown
Contributor

botantler Bot commented May 29, 2026

🎉 This PR is included in version 1.12.4 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants