Skip to content

Using hybrid method to combine allocatable wait and flock wait#514

Closed
zxqlxy wants to merge 3 commits intoGoogleCloudPlatform:masterfrom
zxqlxy:xinyunliu/master
Closed

Using hybrid method to combine allocatable wait and flock wait#514
zxqlxy wants to merge 3 commits intoGoogleCloudPlatform:masterfrom
zxqlxy:xinyunliu/master

Conversation

@zxqlxy
Copy link
Copy Markdown

@zxqlxy zxqlxy commented Aug 19, 2025

Add the local lock to avoid deadlock when we enable the maxSurge rollingUpdate strategy as there will be two gpu-device-plugin running at the same time during the rollingUpdate. Use remote node status when there is no lock there to prevent deadlock though a little bit slower.

Comment thread pkg/gpu/nvidia/util/util.go Outdated
Comment thread pkg/gpu/nvidia/util/util.go
Comment thread cmd/nvidia_gpu/nvidia_gpu.go Outdated
@zxqlxy zxqlxy requested a review from linxiulei August 27, 2025 21:28
glog.Infof("Failed to build kube client: %v", err)
return
}
nodeName, err := util.GetEnv(nodeNameEnv)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require a change in DaemonSet yaml, right? should we getHostname if failed to get this env var?

@zxqlxy zxqlxy closed this Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants