ci: discount Unhealthy probe warnings on deleted pods in the deploy gate by devantler · Pull Request #1646 · devantler-tech/platform

devantler · 2026-05-28T22:13:35Z

🤖 Generated by the Daily AI Assistant

Follow-up to #1637 (the deterministic fix for the rollout teardown race discussed there).

Problem

The merge-queue Deploy to Prod gate (.github/actions/check-event-warnings) fails on any Warning event in its 90s settle window. During a rollout, kubelet fires one last liveness/readiness probe ~1s after Cilium tears down a terminating pod's route:

Unhealthy: …:9003/healthz: connect: no route to host

That event is on a pod that is already deleted — a teardown artifact, not a steady-state fault — yet the gate counted it and failed the deploy. It hit OpenCost in #1637 and is a recurring flaky-failure source for any PR that rolls a pod. The gate's own description says it targets warnings "still firing at steady state"; a one-shot probe on a gone pod isn't that.

Fix

Snapshot live pods alongside the events, and partition off (don't count) Unhealthy warnings whose involved Pod is absent from the snapshot.

property	behaviour
Scope	only `reason == "Unhealthy"` on a `Pod` not in the live snapshot
Real crash-loops / persistent probe failures	still counted — those pods exist (CrashLoopBackOff, etc.)
Other reasons (`BackOff`, `FailedMount`, …) on deleted pods	still counted (not discounted)
Snapshot unavailable	fail-safe — keep every warning (never silently hide because the snapshot broke)
Transparency	discounted warnings printed in a report-only `::group::` so nothing is masked without a trace
Event shapes	both core/v1 (`.involvedObject`) and events.k8s.io/v1 (`.regarding`)

Why this is safe

The gate's purpose is to catch problems still firing at steady state. By definition those occur on pods that exist — a crash-looping pod is in CrashLoopBackOff, a broken-readiness pod is Running-but-unready. Only the artifacts of pods that have ceased to exist are discounted, and even then only the probe-failure (Unhealthy) reason. Everything else is untouched, and the discounted set is logged.

Validation

jq partitioning unit-tested across every case: deleted-pod Unhealthy → dropped; live-pod Unhealthy, non-Unhealthy on deleted pods, non-Pod warnings, pre-marker events → correct; null snapshot → keep all.
shellcheck clean on the embedded run script; actionlint clean; yq parses.

Note: this changes CI tooling only (no platform manifests), so it's a ci: change and shouldn't trigger a platform release.

🤖 Generated with Claude Code

> 🤖 Generated by the Daily AI Assistant The merge-queue "Deploy to Prod" gate (check-event-warnings) fails on any Warning event firing in its 90s settle window. During a rollout, kubelet fires one last liveness/readiness probe ~1s after the CNI (Cilium) tears down a terminating pod's route, emitting: Unhealthy: …/healthz: connect: no route to host against a pod that is already deleted. That is a teardown artifact, not a steady-state fault — but the gate counted it and failed the deploy. It hit OpenCost in PR #1637 and is a recurring source of flaky merge-queue failures for any PR that rolls a pod. Fix: snapshot live pods alongside the events, and split off (do not count) "Unhealthy" warnings whose involved Pod no longer exists. Precise and safe: - Only reason == "Unhealthy" on a Pod that is absent from the live snapshot is discounted. Real crash loops and persistent probe failures occur on pods that STILL EXIST, so they stay counted and still fail the gate. - Other warning reasons (BackOff, FailedMount, …) are never discounted, even on deleted pods. - Fail-safe: if the pod snapshot can't be fetched, the filter keeps every warning (never silently hides warnings because the snapshot broke). - Transparency: discounted warnings are printed in a report-only group so a masked issue always leaves a trace in the log. - Both event shapes handled (core/v1 .involvedObject, events.k8s.io/v1 .regarding). Validated: jq partitioning unit-tested across all cases (deleted-pod Unhealthy → dropped; live-pod Unhealthy, non-Unhealthy on deleted pods, non-Pod warnings, pre-marker events → correct; null snapshot → keep all). shellcheck clean on the run script; actionlint clean; yq parses.

Copilot

Pull request overview

Updates the check-event-warnings composite action to discount one-shot Unhealthy probe warnings on pods that no longer exist (rollout teardown artifacts) while still counting real crash loops and persistent probe failures. This addresses the recurring flaky deploy-gate failure described in #1637.

Changes:

Snapshot live pods alongside events and pass the namespaced names into the jq filter as --argjson livePods.
Partition warnings into kept (counted) and dropped (Unhealthy on a Pod absent from the snapshot), with a fail-safe that keeps all warnings if the snapshot can't be obtained.
Print discounted warnings in a report-only ::group:: so nothing is masked without a trace, and update the action's description accordingly.

devantler · 2026-05-28T22:26:17Z

I do not want this. It is a hacky workaround

Copilot AI review requested due to automatic review settings May 28, 2026 22:13

github-project-automation Bot added this to 🌊 Project Board May 28, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 28, 2026

Copilot started reviewing on behalf of devantler May 28, 2026 22:13 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

devantler marked this pull request as ready for review May 28, 2026 22:24

devantler closed this May 28, 2026

github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 28, 2026

devantler deleted the claude/repo-assist-gate-ignore-deleted-pod-warnings branch May 28, 2026 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: discount Unhealthy probe warnings on deleted pods in the deploy gate#1646

ci: discount Unhealthy probe warnings on deleted pods in the deploy gate#1646
devantler wants to merge 1 commit into
mainfrom
claude/repo-assist-gate-ignore-deleted-pod-warnings

devantler commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

devantler commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 28, 2026

Problem

Fix

Why this is safe

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

devantler commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants