Skip to content

ci: discount Unhealthy probe warnings on deleted pods in the deploy gate#1646

Closed
devantler wants to merge 1 commit into
mainfrom
claude/repo-assist-gate-ignore-deleted-pod-warnings
Closed

ci: discount Unhealthy probe warnings on deleted pods in the deploy gate#1646
devantler wants to merge 1 commit into
mainfrom
claude/repo-assist-gate-ignore-deleted-pod-warnings

Conversation

@devantler
Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Follow-up to #1637 (the deterministic fix for the rollout teardown race discussed there).

Problem

The merge-queue Deploy to Prod gate (.github/actions/check-event-warnings) fails on any Warning event in its 90s settle window. During a rollout, kubelet fires one last liveness/readiness probe ~1s after Cilium tears down a terminating pod's route:

Unhealthy: …:9003/healthz: connect: no route to host

That event is on a pod that is already deleted — a teardown artifact, not a steady-state fault — yet the gate counted it and failed the deploy. It hit OpenCost in #1637 and is a recurring flaky-failure source for any PR that rolls a pod. The gate's own description says it targets warnings "still firing at steady state"; a one-shot probe on a gone pod isn't that.

Fix

Snapshot live pods alongside the events, and partition off (don't count) Unhealthy warnings whose involved Pod is absent from the snapshot.

property behaviour
Scope only reason == "Unhealthy" on a Pod not in the live snapshot
Real crash-loops / persistent probe failures still counted — those pods exist (CrashLoopBackOff, etc.)
Other reasons (BackOff, FailedMount, …) on deleted pods still counted (not discounted)
Snapshot unavailable fail-safe — keep every warning (never silently hide because the snapshot broke)
Transparency discounted warnings printed in a report-only ::group:: so nothing is masked without a trace
Event shapes both core/v1 (.involvedObject) and events.k8s.io/v1 (.regarding)

Why this is safe

The gate's purpose is to catch problems still firing at steady state. By definition those occur on pods that exist — a crash-looping pod is in CrashLoopBackOff, a broken-readiness pod is Running-but-unready. Only the artifacts of pods that have ceased to exist are discounted, and even then only the probe-failure (Unhealthy) reason. Everything else is untouched, and the discounted set is logged.

Validation

  • jq partitioning unit-tested across every case: deleted-pod Unhealthy → dropped; live-pod Unhealthy, non-Unhealthy on deleted pods, non-Pod warnings, pre-marker events → correct; null snapshot → keep all.
  • shellcheck clean on the embedded run script; actionlint clean; yq parses.

Note: this changes CI tooling only (no platform manifests), so it's a ci: change and shouldn't trigger a platform release.

🤖 Generated with Claude Code

> 🤖 Generated by the Daily AI Assistant

The merge-queue "Deploy to Prod" gate (check-event-warnings) fails on any
Warning event firing in its 90s settle window. During a rollout, kubelet
fires one last liveness/readiness probe ~1s after the CNI (Cilium) tears
down a terminating pod's route, emitting:

  Unhealthy: …/healthz: connect: no route to host

against a pod that is already deleted. That is a teardown artifact, not a
steady-state fault — but the gate counted it and failed the deploy. It hit
OpenCost in PR #1637 and is a recurring source of flaky merge-queue failures
for any PR that rolls a pod.

Fix: snapshot live pods alongside the events, and split off (do not count)
"Unhealthy" warnings whose involved Pod no longer exists. Precise and safe:

  - Only reason == "Unhealthy" on a Pod that is absent from the live snapshot
    is discounted. Real crash loops and persistent probe failures occur on
    pods that STILL EXIST, so they stay counted and still fail the gate.
  - Other warning reasons (BackOff, FailedMount, …) are never discounted,
    even on deleted pods.
  - Fail-safe: if the pod snapshot can't be fetched, the filter keeps every
    warning (never silently hides warnings because the snapshot broke).
  - Transparency: discounted warnings are printed in a report-only group so a
    masked issue always leaves a trace in the log.
  - Both event shapes handled (core/v1 .involvedObject, events.k8s.io/v1
    .regarding).

Validated: jq partitioning unit-tested across all cases (deleted-pod
Unhealthy → dropped; live-pod Unhealthy, non-Unhealthy on deleted pods,
non-Pod warnings, pre-marker events → correct; null snapshot → keep all).
shellcheck clean on the run script; actionlint clean; yq parses.
Copilot AI review requested due to automatic review settings May 28, 2026 22:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the check-event-warnings composite action to discount one-shot Unhealthy probe warnings on pods that no longer exist (rollout teardown artifacts) while still counting real crash loops and persistent probe failures. This addresses the recurring flaky deploy-gate failure described in #1637.

Changes:

  • Snapshot live pods alongside events and pass the namespaced names into the jq filter as --argjson livePods.
  • Partition warnings into kept (counted) and dropped (Unhealthy on a Pod absent from the snapshot), with a fail-safe that keeps all warnings if the snapshot can't be obtained.
  • Print discounted warnings in a report-only ::group:: so nothing is masked without a trace, and update the action's description accordingly.

@devantler devantler marked this pull request as ready for review May 28, 2026 22:24
@devantler devantler closed this May 28, 2026
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 28, 2026
@devantler devantler deleted the claude/repo-assist-gate-ignore-deleted-pod-warnings branch May 28, 2026 22:26
@devantler
Copy link
Copy Markdown
Contributor Author

I do not want this. It is a hacky workaround

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants