Skip to content

Delete driver pod in update-clusterpolicy#2136

Open
JunAr7112 wants to merge 1 commit intoNVIDIA:mainfrom
JunAr7112:update_driver_delete
Open

Delete driver pod in update-clusterpolicy#2136
JunAr7112 wants to merge 1 commit intoNVIDIA:mainfrom
JunAr7112:update_driver_delete

Conversation

@JunAr7112
Copy link
Contributor

Description

Fix flaking failures in the update-clusterpolicy test. The issue here is that when the NVIDIADriver CR is patched with new labels, the pod template in the DaemonSet is updated with the new labels, but the existing pod isn't automatically restarted.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Signed-off-by: Arjun <agadiyar@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@JunAr7112
Copy link
Contributor Author

/ok-to-test b9ace21

@rahulait
Copy link
Contributor

@JunAr7112 is the description correct? I see you had mentioned The issue here is that when the NVIDIADriver CR is patched, but it looks like its patching clusterpolicy and there is no nvidiadriver CR here.


# Delete driver pod to force recreation with updated labels. Existing pods are not automatically restarted due to the DaemonSet OnDelete updateStrategy.
echo "Deleting driver pod to trigger recreation with updated labels..."
kubectl delete pod -n "$TEST_NAMESPACE" -l app=nvidia-driver-daemonset --ignore-not-found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrade-controller is responsible for deleting the pod. We shouldn't be deleting manually in the tests. When the labels for daemonset changes, daemonset hash changes. Upgrade controller then detects that hash of daemonset has changed and then it labels the node as "upgrade-required". Then upgrade-controller gracefully terminates the driver pod and new one gets created after that.

If its failing to delete sometimes, we need to gather more logs as to why and maybe it takes some extra time to catch up. So I would try to understand and see if increasing time for the test(like timing out after 10 mins to timeout after 15 mins) helps or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. I'll try looking at the logs here and verifying why it isn't deleting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments