fix(apps): add startupProbe to homepage, headlamp, actual-budget#1636
fix(apps): add startupProbe to homepage, headlamp, actual-budget#1636devantler wants to merge 4 commits into
Conversation
Each of these charts hardcodes liveness/readiness with Kubernetes defaults (timeoutSeconds: 1, periodSeconds: 10, failureThreshold: 3) and does not template a startupProbe. On a cold start the container takes ~10–13s to begin serving HTTP, so each pod creation logs 1–3 'Unhealthy' Warning events and leaves only ~17s of headroom before the liveness restart fires. Add a strategic-merge startupProbe via the existing postRenderer block (60s startup window, 2s period/timeout) so liveness/readiness are gated until the container is actually serving. No change to liveness/readiness — once startup succeeds the tight defaults are fine on a warm pod. Observed (prod, 14:59–15:44 UTC on 2026-05-28): - homepage: 3 Unhealthy events per rollout pod (5 pods affected) - actual-budget: 2 events at cold start - headlamp: 1 event per KEDA scale-from-zero Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR reduces noisy cold-start Unhealthy events for selected applications by adding startupProbe patches through existing Flux HelmRelease post-renderers.
Changes:
- Adds a 60s HTTP startup window for
homepage. - Adds the same startup gating for KEDA-scaled
headlamp. - Adds startup gating for
actual-budgetwhile leaving existing liveness/readiness behavior unchanged.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
k8s/bases/apps/homepage/helm-release.yaml |
Adds a JSON6902 post-renderer patch for the main container startup probe. |
k8s/bases/apps/headlamp/helm-release.yaml |
Adds a startup probe to the main Headlamp container without affecting the plugin sidecar. |
k8s/bases/apps/actual-budget/helm-release.yaml |
Adds a startup probe to the rendered Actual Budget Deployment. |
The original probe (periodSeconds: 2, failureThreshold: 30, no initialDelay) silenced cold-start liveness/readiness *restarts* but not the underlying "Unhealthy" Warning events — kubelet emits the same event for startup, liveness, and readiness probe failures, and the 2s period generates 5-7 failures during the ~13s cold start instead of the chart-default 1-3 (periodSeconds: 10). Merge-queue deploy of #1636 failed the check-event-warnings action, which records a marker post-reconcile and fails if any Warning event has lastTimestamp within a 90s settle window. The rollout these patches force created new pods during that window; their startup probes fired every 2s during cold start; their events landed past the marker. Set initialDelaySeconds: 20 (past the observed ~13s cold start) and periodSeconds: 5 so the first probe lands on a serving container. Zero failure events on a normal rollout; failureThreshold: 12 leaves 60s of grace if a container is unusually slow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge-queue failure analysis & fixThe merge-queue's Root cause — own goal:
Fix (19fb2b3): startupProbe:
httpGet: { path: /, port: http }
initialDelaySeconds: 20 # past the observed ~13s cold start
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 12 # 60s grace beyond initialDelay = 80s totalThe first probe lands on a serving container, so the probe never fails on a normal rollout — zero events in the settle window. Trade-off: pods are NotReady for ~20 s instead of ~13 s on cold start (rollouts take 7 s longer per pod), since
|
# Conflicts: # k8s/bases/apps/homepage/helm-release.yaml
Merge-queue failure analysis: prod cluster degraded, NOT a PR regressionPR #1636 only modifies three app The three flagged events:
What's actually brokenRoot pattern: pods crash-loop trying to reach the kube-apiserver ClusterIP ( Smoking gun:
|
| Node | Total restarts (>10) |
|---|---|
| prod-worker-2 | 1072 |
| prod-worker-1 | 400 |
| prod-control-plane-{1,2,3} / prod-worker-3 | ~90 each |
— and cilium-2z7fv on prod-worker-2 was OOMKilled at 2026-05-28T22:07:32Z (exit 137). Spire-server's apiserver-notifier failures begin at exactly that timestamp. After the Cilium agent restart, ClusterIP routing on prod-worker-2 hasn't fully recovered — pods on that node keep timing out on Service IPs.
Why Cilium got OOMed
prod-worker-2 resource state:
memory 7453078434 (98%) 33132Mi (457%) ← requests / limits
cpu 3921m (99%) 32150m (813%)
usage 5289Mi (72%) — top node
98% memory requests committed, no slack. And 25 critical infra pods are QoS: BestEffort — including all 6 spire-agents, the SPIRE server, both Cilium operators, all 6 cilium-envoy DaemonSet pods, and the HCloud CSI controller/nodes. BestEffort is first to evict under memory pressure, and Cilium itself ships with resources: {} in the HelmRelease values. When prod-worker-2 hit memory pressure, the kernel killed the highest-RSS BestEffort process — cilium-agent — and the BPF state hasn't fully reconciled since.
Cascade
prod-worker-2 saturated → cilium-2z7fv OOM → BPF maps degraded
→ pods on prod-worker-2 can't reach Service ClusterIPs (10.96.0.1, 10.96.193.18)
→ spire-agent on prod-worker-2 can't attest → no /run/spire/sockets/admin.sock on host
→ cilium-2z7fv now logging continuously: "SPIRE Delegate API Client failed to init"
→ cert-manager / kubevirt / fleet / flux all crash-loop their apiserver-bound clients
What this PR should do: nothing
The probe changes here are sound and the merge-queue check is correctly surfacing a real prod incident. The right place to address this is a dedicated PR (see follow-up task spawned in the parent session) that:
- Sets explicit
requests(and optionallimits) on Cilium agent, operator, envoy, and on SPIRE agent/server — promote them out of BestEffort QoS so the kernel doesn't pick them when prod-worker-2 saturates. - Investigates prod-worker-2's high allocation (98% requests committed) and rebalances if needed.
- Operator follow-up to clear the stuck BPF state on prod-worker-2:
kubectl delete pod -n kube-system cilium-2z7fv(Cilium DaemonSet recreates it) or a node reboot.
I will not modify PR #1636 to add those fixes — mixing scopes would make the rollback story for a probe-tuning change much worse if any of the Cilium changes turn out to need iteration.
…ut of BestEffort QoS (#1649) * fix(cilium,spire): set resource requests to promote critical agents out of BestEffort QoS Cilium agent, operator, envoy, and the embedded SPIRE server/agent run in BestEffort QoS by default — the upstream chart leaves `resources:` empty everywhere. On 2026-05-28 prod-worker-2 (at 98% memory request commitment) OOMKilled `cilium-2z7fv`; the restarted agent left ClusterIP routing on that node degraded, then got stuck retrying `SPIRE admin socket (/run/spire/sockets/admin.sock) does not exist` because the spire-agent DaemonSet pod for the node was also BestEffort and crash-looping. ~13 workloads cascaded into i/o timeout against `10.96.0.1:443` and `10.96.193.18:8081` (cert-manager-cainjector, virt-handler, all spire-agents, spire-server, kustomize-controller, flux-operator, fleet, keda http external scaler, kube-state-metrics, trust-manager, origin-ca-issuer, csi-provisioner). Add explicit requests to: - `resources` (agent DaemonSet) — 200m / 512Mi (observed steady-state ~165m / 340Mi) - `envoy.resources` (standalone cilium-envoy DaemonSet) — 50m / 128Mi - `operator.resources` — 100m / 256Mi - `authentication.mutual.spire.install.server.resources` — 50m / 128Mi - `authentication.mutual.spire.install.agent.resources` — 50m / 128Mi All five pods are now Burstable instead of BestEffort, so they're no longer first in line for kubelet eviction / OOMKill under node memory pressure. Limits intentionally unset — Cilium recommends against capping the agent. Out of scope: prod-worker-2 sits at 98% memory commitment for unrelated reasons (VPA recommendations + workload density). Adding ~768Mi of new DaemonSet requests per node will tip it further; a follow-up rebalance or worker scale-up is likely needed. Flagged in PR body. Recovery action (separate from this PR): once Flux has reconciled the new resources, restart the wedged agent with `kubectl --context=admin@prod delete pod -n kube-system cilium-2z7fv`. If prod-worker-2 doesn't recover within ~5 min, reboot the node via talosctl / Hetzner console. Validated with: - ksail workload validate (256 files ok) - ksail --config ksail.prod.yaml workload validate (256 files ok) - kubectl kustomize k8s/clusters/{local,prod}/ — clean - kubectl kustomize k8s/providers/{docker,hetzner}/infrastructure/controllers/ — clean Refs: #1636 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger after transient ksail-workload-validate EOF on bases/apps/headlamp Previous run (26621107142) failed in System Test at the kubeconform step with `validation failed: EOF` for `bases/apps/headlamp` — a schema-fetch network blip, not a content failure. Headlamp is untouched by this PR (diff is the Cilium HelmRelease only), the manifest validates cleanly locally on this branch, and the previous successful main-line run validated headlamp from the same files. Empty commit to retrigger CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Background
Prod-cluster warning-event investigation surfaced 14
Unhealthyevents in a recent window, all of which mapped to cold-start probe failures on container creation — not shutdown races, not real downtime. Each affected chart hardcodeslivenessProbe/readinessProbewith Kubernetes defaults (timeoutSeconds: 1,periodSeconds: 10,failureThreshold: 3) and ships nostartupProbe. Containers take ~10–13 s to begin serving, so each pod creation logs 1–3Unhealthywarnings and leaves only ~17 s of headroom before the liveness restart would fire.Observed (prod, 14:59–15:44 UTC 2026-05-28):
homepage: 3Unhealthyevents per pod, 5 pods affected during one rolloutactual-budget: 2 events at cold startheadlamp: 1 event per KEDA scale-from-zero (recurring)Fix
Add a
startupProbeto each Deployment via the existingpostRendererblock:Total startup budget before liveness takes over: 20 s + 12 × 5 s = 80 s. Liveness/readiness untouched — once startup succeeds the tight chart defaults are fine on a warm pod. None of the three charts expose
startupProbeas a values knob, so all three use a strategic-merge/JSON patch on the rendered Deployment.Why
initialDelaySeconds, not faster pollingA first revision used
periodSeconds: 2, failureThreshold: 30(noinitialDelaySeconds). That failed the merge-queuecheck-event-warningsaction (commit 19fb2b3 fixes it). kubelet emits the samereason=UnhealthyWarning event regardless of whether startup, liveness, or readiness failed — so a startupProbe alone doesn't silence the noise. AndperiodSeconds: 2actually made it worse: a ~13 s cold start produced 5–7 failed-probe events per pod (vs. 1–3 with the chart-defaultperiodSeconds: 10). The action records a marker post-reconcile and fails on any Warning withlastTimestampwithin a 90 s settle window; this PR's rollouts now created cold-start failures inside that window.The current revision sets
initialDelaySeconds: 20so the first probe lands on an already-serving container — zero failed probes on a normal rollout, and the chart-default liveness/readiness only start after startup succeeds (i.e. once HTTP is healthy).Trade-offs
headlamp/actual-budgetthe patch lands oncontainers[0](the main app container); theheadlamp-pluginsidecar has no probes and is unaffected.Validation
ksail workload validate→256 files validated(local)ksail --config ksail.prod.yaml workload validate→256 files validatedkubectl kustomize k8s/providers/{docker,hetzner}/apps/both succeed; all threestartupProbeblocks resolve into the rendered HelmRelease post-renderer.