Skip to content

[Question]: When using GPU operator, how do you integrate with node problem detector for gpu checks? e.g. to check if drivers are installed #2137

@NurullahMorshed

Description

@NurullahMorshed

I migrated from running host drivers to managing drivers running GPU operator on our clusterw. In the past we were able to have node problem detector run and check various information about the device and drivers and set conditions. However, many of those checks relied on nvidia-smi.

Given, that nvidia-smi is not available on the host when using gpu operator, what is the best approach for integrating with Node Problem Detector? Specifically, we want to be able to have a condition that tells us if drivers are installed which we plan to use with https://kubernetes.io/blog/2026/02/03/introducing-node-readiness-controller/ to taint nodes until they are ready after spinning up a new node,

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions