Delete driver pod in update-clusterpolicy#2136
Conversation
Signed-off-by: Arjun <agadiyar@nvidia.com>
|
/ok-to-test b9ace21 |
|
@JunAr7112 is the description correct? I see you had mentioned |
|
|
||
| # Delete driver pod to force recreation with updated labels. Existing pods are not automatically restarted due to the DaemonSet OnDelete updateStrategy. | ||
| echo "Deleting driver pod to trigger recreation with updated labels..." | ||
| kubectl delete pod -n "$TEST_NAMESPACE" -l app=nvidia-driver-daemonset --ignore-not-found |
There was a problem hiding this comment.
Upgrade-controller is responsible for deleting the pod. We shouldn't be deleting manually in the tests. When the labels for daemonset changes, daemonset hash changes. Upgrade controller then detects that hash of daemonset has changed and then it labels the node as "upgrade-required". Then upgrade-controller gracefully terminates the driver pod and new one gets created after that.
If its failing to delete sometimes, we need to gather more logs as to why and maybe it takes some extra time to catch up. So I would try to understand and see if increasing time for the test(like timing out after 10 mins to timeout after 15 mins) helps or not.
There was a problem hiding this comment.
Ok I see. I'll try looking at the logs here and verifying why it isn't deleting.
Description
Fix flaking failures in the update-clusterpolicy test. The issue here is that when the NVIDIADriver CR is patched with new labels, the pod template in the DaemonSet is updated with the new labels, but the existing pod isn't automatically restarted.
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing