-
Notifications
You must be signed in to change notification settings - Fork 41
docs(skills): initial conversion of GPU Operators skills #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
miyoungc
wants to merge
1
commit into
main
Choose a base branch
from
gpu-operator-docs-to-skills
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../gpu-operator/.agents/skills |
189 changes: 189 additions & 0 deletions
189
gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| --- | ||
| name: "gpu-operator-container-device" | ||
| description: "Explains how to configure CDI and NRI support for GPU workloads. Use when enabling CDI, configuring containerd, or troubleshooting CDI-based GPU injection. Trigger keywords - NVIDIA GPU Operator, CDI, NRI, containerd, Kubernetes." | ||
| --- | ||
|
|
||
| <!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> | ||
| <!-- SPDX-License-Identifier: Apache-2.0 --> | ||
|
|
||
| # Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support | ||
|
|
||
| This page gives an overview of CDI and NRI Plugin support in the GPU Operator. | ||
|
|
||
| ## About Container Device Interface (CDI) | ||
|
|
||
| The [Container Device Interface (CDI)](https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md) | ||
| is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means, | ||
| and standardizes access across container runtimes. Popular container runtimes can read and process the specification to | ||
| ensure that a device is available in a container. CDI simplifies adding support for devices such as NVIDIA GPUs because | ||
| the specification is applicable to all container runtimes that support CDI. | ||
|
|
||
| Starting with GPU Operator v25.10.0, CDI is used by default for enabling GPU support in containers running on Kubernetes. | ||
| Specifically, CDI support in container runtimes, like containerd and cri-o, is used to inject GPU(s) into workload | ||
| containers. This differs from prior GPU Operator releases where CDI was used via a CDI-enabled `nvidia` runtime class. | ||
|
|
||
| If you are upgrading from a version of the GPU Operator prior to v25.10.0, where CDI was disabled by default, and you are upgrading to v25.10.0 or later, where CDI is enabled by default, no configuration changes are required for standard workloads using GPU allocation through the Device Plugin. | ||
| For workloads that already have `runtimeClassName: nvidia` set in their pod spec YAML, no change is necessary. | ||
|
|
||
| Use of CDI is transparent to cluster administrators and application developers. | ||
| The benefits of CDI are largely to reduce development and support for runtime-specific | ||
| plugins. | ||
|
|
||
| ### CDI and GPU Management Containers | ||
|
|
||
| When CDI is enabled in GPU Operator versions v25.10.0 and later, GPU Management Containers that use the `NVIDIA_VISIBLE_DEVICES` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs, must set `runtimeClassName: nvidia` in the pod specification. | ||
| A GPU Management Container is a container that requires access to all GPUs without them being allocated by Kubernetes. | ||
| Examples of GPU Management Containers include monitoring agents and device plugins. | ||
|
|
||
| It is recommended that `NVIDIA_VISIBLE_DEVICES` only be used by GPU Management Containers. | ||
|
|
||
| **Note:** | ||
|
|
||
| Setting `runtimeClassName: nvidia` in the pod specification is not required when the NRI Plugin is enabled in GPU Operator. | ||
| Refer to About the Node Resource Interface (NRI) Plugin. | ||
|
|
||
| ## Step 1: Enabling CDI | ||
|
|
||
| CDI is enabled by default during installation in GPU Operator v25.10.0 and later. | ||
| Follow the instructions for installing the Operator with Helm on the getting-started page. | ||
|
|
||
| CDI is also enabled by default during a Helm upgrade to GPU Operator v25.10.0 and later. | ||
|
|
||
| ### Enabling CDI After Installation | ||
|
|
||
| CDI is enabled by default in GPU Operator v25.10.0 and later. | ||
| Use the following procedure to enable CDI if you disabled CDI during installation. | ||
|
|
||
| ### Procedure | ||
| 1. Enable CDI by modifying the cluster policy: | ||
|
|
||
| ```console | ||
| $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ | ||
| -p='[{"op": "replace", "path": "/spec/cdi/enabled", "value":true}]' | ||
| ``` | ||
|
|
||
| *Example Output* | ||
|
|
||
| ```output | ||
| clusterpolicy.nvidia.com/cluster-policy patched | ||
| ``` | ||
|
|
||
| 1. (Optional) Confirm that the container toolkit and device plugin pods restart: | ||
|
|
||
| ```console | ||
| $ kubectl get pods -n gpu-operator | ||
| ``` | ||
|
|
||
| *Example Output* | ||
|
|
||
| ## Step 2: Disabling CDI | ||
|
|
||
| While CDI is the default and recommended mechanism for injecting GPU support into containers, you can | ||
| disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the following procedure: | ||
|
|
||
| 1. If your nodes use the CRI-O container runtime, then temporarily disable the | ||
| GPU Operator validator: | ||
|
|
||
| ```console | ||
| $ kubectl label nodes \ | ||
| nvidia.com/gpu.deploy.operator-validator=false \ | ||
| -l nvidia.com/gpu.present=true \ | ||
| --overwrite | ||
| ``` | ||
|
|
||
| **Tip:** | ||
|
|
||
| You can run `kubectl get nodes -o wide` and view the `CONTAINER-RUNTIME` | ||
| column to determine if your nodes use CRI-O. | ||
| 1. Disable CDI by modifying the cluster policy: | ||
|
|
||
| ```console | ||
| $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ | ||
| -p='[{"op": "replace", "path": "/spec/cdi/enabled", "value":false}]' | ||
| ``` | ||
|
|
||
| *Example Output* | ||
|
|
||
| ```output | ||
| clusterpolicy.nvidia.com/cluster-policy patched | ||
| ``` | ||
|
|
||
| 1. If you temporarily disabled the GPU Operator validator, re-enable the validator: | ||
|
|
||
| ```console | ||
| $ kubectl label nodes \ | ||
| nvidia.com/gpu.deploy.operator-validator=true \ | ||
| nvidia.com/gpu.present=true \ | ||
| --overwrite | ||
| ``` | ||
|
|
||
| ## About the Node Resource Interface (NRI) Plugin | ||
|
|
||
| Node Resource Interface (NRI) is a standardized interface for plugging in extensions, called NRI Plugins, to OCI-compatible container runtimes like containerd. | ||
| NRI Plugins serve as hooks which intercept pod and container lifecycle events and perform functions including injecting devices to a container, topology aware placement strategies, and more. For more details on NRI, refer to the [NRI overview](https://github.com/containerd/nri/tree/main?tab=readme-ov-file#background) in the containerd repository. | ||
|
|
||
| When enabled in the GPU Operator, the NVIDIA Container Toolkit daemonset will run an NRI Plugin on every GPU node. | ||
| The purpose of the NRI Plugin is to inject GPUs into GPU management containers that use the `NVIDIA_VISIBLE_DEVICES` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs. | ||
|
|
||
| In previous GPU Operator versions, device injection was handled by the `nvidia` container runtime. With CDI and the NRI Plugin enabled, the `nvidia` runtime class is no longer needed. When enabling the NRI plugin during install, the `nvidia` runtime class will not be created. If you enable the NRI Plugin after install, the `nvidia` runtime class will be deleted. | ||
|
|
||
| Additionally, with the NRI Plugin enabled, modifications to the container runtime configuration are no longer needed. For example, no modifications are made to containerd’s config.toml file. | ||
| This means that on platforms that configure containerd in a non-standard way, like k3s, k0s, and Rancher Kubernetes Engine 2, users no longer need to configure environment variables like `CONTAINERD_CONFIG`, `CONTAINERD_SOCKET`, or `RUNTIME_CONFIG_SOURCE`. | ||
|
|
||
| ## Step 3: Enabling the NRI Plugin | ||
|
|
||
| The NRI Plugin requires the following: | ||
|
|
||
| - CDI to be enabled in the GPU Operator. | ||
|
|
||
| - containerd v1.7.30, v2.1.x, or v2.2.x. | ||
| If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator. | ||
|
|
||
| **Note:** | ||
|
|
||
| Enabling the NRI plugin is not supported with cri-o. | ||
| To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the getting-started page and include the `--set cdi.nriPluginEnabled=true` argument in your Helm command. | ||
|
|
||
| ### Enabling the NRI Plugin After Installation | ||
|
|
||
| 1. Enable NRI Plugin by modifying the cluster policy: | ||
|
|
||
| ```console | ||
| $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ | ||
| -p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":true}]' | ||
| ``` | ||
|
|
||
| *Example Output* | ||
|
|
||
| ```output | ||
| clusterpolicy.nvidia.com/cluster-policy patched | ||
| ``` | ||
|
|
||
| After enabling the NRI Plugin, the `nvidia` runtime class will be deleted. | ||
|
|
||
| 1. (Optional) Confirm that the container toolkit and device plugin pods restart: | ||
|
|
||
| ```console | ||
| $ kubectl get pods -n gpu-operator | ||
| ``` | ||
|
|
||
| *Example Output* | ||
|
|
||
| ## Step 4: Disabling the NRI Plugin | ||
|
|
||
| Disable the NRI Plugin and use the `nvidia` runtime class instead with the following procedure: | ||
|
|
||
| Disable the NRI Plugin by modifying the cluster policy: | ||
|
|
||
| ```console | ||
| $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ | ||
| -p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":false}]' | ||
| ``` | ||
|
|
||
| *Example Output* | ||
|
|
||
| ```output | ||
| clusterpolicy.nvidia.com/cluster-policy patched | ||
| ``` | ||
|
|
||
| After disabling the NRI Plugin, the `nvidia` runtime class will be created. | ||
104 changes: 104 additions & 0 deletions
104
gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| --- | ||
| name: "gpu-operator-custom-driver" | ||
| description: "Shows how to provide custom NVIDIA driver parameters to GPU Operator driver containers. Use when changing driver module options or customizing driver container behavior. Trigger keywords - NVIDIA GPU Operator, driver parameters, NVIDIA driver, configuration." | ||
| --- | ||
|
|
||
| <!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> | ||
| <!-- SPDX-License-Identifier: Apache-2.0 --> | ||
|
|
||
| # Customizing NVIDIA GPU Driver Parameters during Installation | ||
|
|
||
| The NVIDIA Driver kernel modules accept a number of parameters which can be used to customize the behavior of the driver. | ||
| By default, the GPU Operator loads the kernel modules with default values. | ||
| On a machine with the driver already installed, you can list the parameter names and values with the `cat /proc/driver/nvidia/params` command. | ||
| You can pass custom parameters to the kernel modules that get loaded as part of the | ||
| NVIDIA Driver installation (`nvidia`, `nvidia-modeset`, `nvidia-uvm`, and `nvidia-peermem`). | ||
|
|
||
| ## Step 1: Configure Custom Driver Parameters | ||
|
|
||
| To pass custom parameters, execute the following steps. | ||
|
|
||
| 1. Create a configuration file named `<module>.conf`, where `<module>` is the name of the kernel module the parameters are for. | ||
| The file should contain parameters as key-value pairs -- one parameter per line. | ||
|
|
||
| The following example shows the GPU firmware logging parameter being passed to the `nvidia` module. | ||
|
|
||
| ```console | ||
| $ cat nvidia.conf | ||
| NVreg_EnableGpuFirmwareLogs=2 | ||
| ``` | ||
|
|
||
| 1. Create a `ConfigMap` for the configuration file. | ||
| If multiple modules are being configured, pass multiple files when creating the `ConfigMap`. | ||
|
|
||
| ```console | ||
| $ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia.conf=./nvidia.conf | ||
| ``` | ||
|
|
||
| 1. Install the GPU Operator and set `driver.kernelModuleConfig.name` to the name of the `ConfigMap` | ||
| containing the kernel module parameters. | ||
|
|
||
| ```console | ||
| $ helm install --wait --generate-name \ | ||
| -n gpu-operator --create-namespace \ | ||
| nvidia/gpu-operator \ | ||
| --version=${version} \ | ||
| --set driver.kernelModuleConfig.name="kernel-module-params" | ||
| ``` | ||
|
|
||
| ### Example using `nvidia-uvm` module | ||
|
|
||
| This example shows the Heterogeneous Memory Management (HMM) being disabled in the `nvidia-uvm` module. | ||
| Refer to [Simplifying GPU Application Development with Heterogeneous Memory Management](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) for more information about HMM. | ||
|
|
||
| 1. Create a configuration file named `nvidia-uvm.conf`: | ||
|
|
||
| ```console | ||
| $ cat nvidia-uvm.conf | ||
| uvm_disable_hmm=1 | ||
| ``` | ||
|
|
||
| 1. Create a `ConfigMap` for the configuration file. | ||
| If multiple modules are being configured, pass multiple files when creating the `ConfigMap`. | ||
|
|
||
| ```console | ||
| $ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia-uvm.conf=./nvidia-uvm.conf | ||
| ``` | ||
|
|
||
| 1. Install the GPU Operator and set `driver.kernelModuleConfig.name` to the name of the `ConfigMap` | ||
| containing the kernel module parameters. | ||
|
|
||
| ```console | ||
| $ helm install --wait --generate-name \ | ||
| -n gpu-operator --create-namespace \ | ||
| nvidia/gpu-operator \ | ||
| --version=${version} \ | ||
| --set driver.kernelModuleConfig.name="kernel-module-params" | ||
| ``` | ||
|
|
||
| 1. Verify the parameter has been correctly applied, go to `/sys/module/nvidia_uvm/parameters/` on the node: | ||
|
|
||
| ```console | ||
| $ ls /sys/module/nvidia_uvm/parameters/ | ||
| ``` | ||
|
|
||
| *Example Output* | ||
|
|
||
| ```output | ||
| ... | ||
| uvm_disable_hmm uvm_perf_access_counter_migration_enable uvm_perf_prefetch_min_faults | ||
| uvm_downgrade_force_membar_sys uvm_perf_access_counter_threshold uvm_perf_prefetch_threshold | ||
| ... | ||
| ``` | ||
|
|
||
| Then check the value of the parameter: | ||
|
|
||
| ```console | ||
| $ cat /sys/module/nvidia_uvm/parameters/uvm_disable_hmm | ||
| ``` | ||
|
|
||
| *Example Output* | ||
|
|
||
| ```output | ||
| Y | ||
| ``` |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Medium:
Trigger keywords - …suffix in description — root_cause:description-trigger-suffixThe converter appends
Trigger keywords - X, Y, Z.to the end of every skill'sdescriptionfield. This isn't part of the Agent Skills spec: the spec describesdescriptionas a focused, single-purpose summary capped at 1024 chars, with separatetriggersandtagsarrays for keyword-style routing.The suffix bloats the description (e.g.
gpu-operator-references/SKILL.md:3runs to 17+ keywords), competes with the actual sentence-form description for the model's attention during routing, and duplicates information that should live undertriggers:/tags:.25 SKILLs affected (all 25 carry the suffix). Recommend dropping the
Trigger keywords - …suffix from everydescriptionand instead emittingtriggers:andtags:arrays in the frontmatter (the upstream RSTs already supply them via the:tags:and:keywords:fields in their.. meta::blocks).