Skip to content

[Kubernetes] Update OOMKill OOTB Dashboards #23266

Open
triviajon wants to merge 1 commit intomasterfrom
triviajon/CONTINT-5192/oom-kill-graph
Open

[Kubernetes] Update OOMKill OOTB Dashboards #23266
triviajon wants to merge 1 commit intomasterfrom
triviajon/CONTINT-5192/oom-kill-graph

Conversation

@triviajon
Copy link
Copy Markdown
Contributor

@triviajon triviajon commented Apr 9, 2026

What does this PR do?

Updates the "Containers OOM Killed" widget on the Kubernetes Pods and Kubernetes Clusters OOTB dashboards. Replaces the old graphed metric with clamp_min(diff(restarts), 0) * last_state.terminated{reason:oomkilled}, using restarts as the event trigger and last_state.terminated as an "OOM filter".

This also meant that we need to group by both pod_name and kube_container_name, because otherwise we get situations where a pod with many containers can see massive overcounting for a single OOM kill (n OOM kills can be overcounted to n(n-1)/2).

The fractional values sometimes visible are a known artifact of the kubelet check not reporting restarts during a container's restart window, which means that there is a small metric gap. Thus taking the diff/derivative over that spreads the increment in restart over multiple points.

Motivation

CONTINT-5192, FSCONS-1115

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants