Delete driver pod in update-clusterpolicy by JunAr7112 · Pull Request #2136 · NVIDIA/gpu-operator

JunAr7112 · 2026-02-18T22:31:55Z

Description

Fix flaking failures in the update-clusterpolicy test. The issue here is that when the NVIDIADriver CR is patched with new labels, the pod template in the DaemonSet is updated with the new labels, but the existing pod isn't automatically restarted.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

Signed-off-by: Arjun <agadiyar@nvidia.com>

copy-pr-bot · 2026-02-18T22:31:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

JunAr7112 · 2026-02-18T22:32:37Z

/ok-to-test b9ace21

rahulait · 2026-02-18T22:42:54Z

@JunAr7112 is the description correct? I see you had mentioned The issue here is that when the NVIDIADriver CR is patched, but it looks like its patching clusterpolicy and there is no nvidiadriver CR here.

rahulait · 2026-02-18T22:47:27Z

tests/scripts/update-clusterpolicy.sh

+
+  # Delete driver pod to force recreation with updated labels. Existing pods are not automatically restarted due to the DaemonSet OnDelete updateStrategy.
+  echo "Deleting driver pod to trigger recreation with updated labels..."
+  kubectl delete pod -n "$TEST_NAMESPACE" -l app=nvidia-driver-daemonset --ignore-not-found


Upgrade-controller is responsible for deleting the pod. We shouldn't be deleting manually in the tests. When the labels for daemonset changes, daemonset hash changes. Upgrade controller then detects that hash of daemonset has changed and then it labels the node as "upgrade-required". Then upgrade-controller gracefully terminates the driver pod and new one gets created after that.

If its failing to delete sometimes, we need to gather more logs as to why and maybe it takes some extra time to catch up. So I would try to understand and see if increasing time for the test(like timing out after 10 mins to timeout after 15 mins) helps or not.

Ok I see. I'll try looking at the logs here and verifying why it isn't deleting.

Delete driver pod in update-clusterpolicy

b9ace21

Signed-off-by: Arjun <agadiyar@nvidia.com>

JunAr7112 requested review from cdesiniotis, karthikvetrivel, rahulait, rajathagasthya, shivamerla and tariq1890 as code owners February 18, 2026 22:31

rahulait reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete driver pod in update-clusterpolicy#2136

Delete driver pod in update-clusterpolicy#2136
JunAr7112 wants to merge 1 commit intoNVIDIA:mainfrom
JunAr7112:update_driver_delete

JunAr7112 commented Feb 18, 2026

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

JunAr7112 commented Feb 18, 2026

Uh oh!

rahulait commented Feb 18, 2026

Uh oh!

rahulait Feb 18, 2026

Uh oh!

JunAr7112 Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

JunAr7112 commented Feb 18, 2026

Description

Checklist

Testing

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

JunAr7112 commented Feb 18, 2026

Uh oh!

rahulait commented Feb 18, 2026

Uh oh!

rahulait Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

JunAr7112 Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments