ci: discount Unhealthy probe warnings on deleted pods in the deploy gate#1646
Closed
devantler wants to merge 1 commit into
Closed
ci: discount Unhealthy probe warnings on deleted pods in the deploy gate#1646devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
> 🤖 Generated by the Daily AI Assistant The merge-queue "Deploy to Prod" gate (check-event-warnings) fails on any Warning event firing in its 90s settle window. During a rollout, kubelet fires one last liveness/readiness probe ~1s after the CNI (Cilium) tears down a terminating pod's route, emitting: Unhealthy: …/healthz: connect: no route to host against a pod that is already deleted. That is a teardown artifact, not a steady-state fault — but the gate counted it and failed the deploy. It hit OpenCost in PR #1637 and is a recurring source of flaky merge-queue failures for any PR that rolls a pod. Fix: snapshot live pods alongside the events, and split off (do not count) "Unhealthy" warnings whose involved Pod no longer exists. Precise and safe: - Only reason == "Unhealthy" on a Pod that is absent from the live snapshot is discounted. Real crash loops and persistent probe failures occur on pods that STILL EXIST, so they stay counted and still fail the gate. - Other warning reasons (BackOff, FailedMount, …) are never discounted, even on deleted pods. - Fail-safe: if the pod snapshot can't be fetched, the filter keeps every warning (never silently hides warnings because the snapshot broke). - Transparency: discounted warnings are printed in a report-only group so a masked issue always leaves a trace in the log. - Both event shapes handled (core/v1 .involvedObject, events.k8s.io/v1 .regarding). Validated: jq partitioning unit-tested across all cases (deleted-pod Unhealthy → dropped; live-pod Unhealthy, non-Unhealthy on deleted pods, non-Pod warnings, pre-marker events → correct; null snapshot → keep all). shellcheck clean on the run script; actionlint clean; yq parses.
Contributor
There was a problem hiding this comment.
Pull request overview
Updates the check-event-warnings composite action to discount one-shot Unhealthy probe warnings on pods that no longer exist (rollout teardown artifacts) while still counting real crash loops and persistent probe failures. This addresses the recurring flaky deploy-gate failure described in #1637.
Changes:
- Snapshot live pods alongside events and pass the namespaced names into the jq filter as
--argjson livePods. - Partition warnings into
kept(counted) anddropped(Unhealthyon a Pod absent from the snapshot), with a fail-safe that keeps all warnings if the snapshot can't be obtained. - Print discounted warnings in a report-only
::group::so nothing is masked without a trace, and update the action's description accordingly.
Contributor
Author
|
I do not want this. It is a hacky workaround |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #1637 (the deterministic fix for the rollout teardown race discussed there).
Problem
The merge-queue Deploy to Prod gate (
.github/actions/check-event-warnings) fails on anyWarningevent in its 90s settle window. During a rollout, kubelet fires one last liveness/readiness probe ~1s after Cilium tears down a terminating pod's route:That event is on a pod that is already deleted — a teardown artifact, not a steady-state fault — yet the gate counted it and failed the deploy. It hit OpenCost in #1637 and is a recurring flaky-failure source for any PR that rolls a pod. The gate's own description says it targets warnings "still firing at steady state"; a one-shot probe on a gone pod isn't that.
Fix
Snapshot live pods alongside the events, and partition off (don't count)
Unhealthywarnings whose involved Pod is absent from the snapshot.reason == "Unhealthy"on aPodnot in the live snapshotBackOff,FailedMount, …) on deleted pods::group::so nothing is masked without a trace.involvedObject) and events.k8s.io/v1 (.regarding)Why this is safe
The gate's purpose is to catch problems still firing at steady state. By definition those occur on pods that exist — a crash-looping pod is in
CrashLoopBackOff, a broken-readiness pod isRunning-but-unready. Only the artifacts of pods that have ceased to exist are discounted, and even then only the probe-failure (Unhealthy) reason. Everything else is untouched, and the discounted set is logged.Validation
Unhealthy→ dropped; live-podUnhealthy, non-Unhealthyon deleted pods, non-Pod warnings, pre-marker events → correct;nullsnapshot → keep all.shellcheckclean on the embedded run script;actionlintclean;yqparses.Note: this changes CI tooling only (no platform manifests), so it's a
ci:change and shouldn't trigger a platform release.🤖 Generated with Claude Code